This is just an enquiry about REINFORCE() class in chapter 11.
class REINFORCE():
......
def optimize_model(self):
T = len(self.rewards)
discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False)
returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])
In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.
This is just an enquiry about REINFORCE() class in chapter 11.
class REINFORCE(): ...... def optimize_model(self): T = len(self.rewards) discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False) returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])
In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.