The use of 'discounts' in REINFORCE() class

This is just an enquiry about REINFORCE() class in chapter 11.

class REINFORCE(): ...... def optimize_model(self): T = len(self.rewards) discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False) returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

    discounts = torch.FloatTensor(discounts).unsqueeze(1)
    returns = torch.FloatTensor(returns).unsqueeze(1)
    self.logpas = torch.cat(self.logpas)

    policy_loss = -(discounts * returns * self.logpas).mean()

In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.

mimoralea / gdrl

The use of 'discounts' in REINFORCE() class #38