mimoralea / gdrl

Grokking Deep Reinforcement Learning
https://www.manning.com/books/grokking-deep-reinforcement-learning
BSD 3-Clause "New" or "Revised" License
812 stars 234 forks source link

The use of 'discounts' in REINFORCE() class #38

Open cwk20 opened 6 months ago

cwk20 commented 6 months ago

This is just an enquiry about REINFORCE() class in chapter 11.

class REINFORCE(): ...... def optimize_model(self): T = len(self.rewards) discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False) returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

    discounts = torch.FloatTensor(discounts).unsqueeze(1)
    returns = torch.FloatTensor(returns).unsqueeze(1)
    self.logpas = torch.cat(self.logpas)

    policy_loss = -(discounts * returns * self.logpas).mean()

In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.