In deep-reinforcement-learning/reinforce/REINFORCE.ipynb
R is implemented as a single value in the following code:
discounts = [gamma**i for i in range(len(rewards)+1)]
R = sum([a*b for a,b in zip(discounts, rewards)])
However it should be implemented as discounted values as follow:
discounted_rewards = []
for t in range(len(rewards)):
Gt = 0
pwr = 0
for r in rewards[t:]:
Gt = Gt + gamma**pwr * r
pwr = pwr + 1
discounted_rewards.append(Gt)
policy_loss = []
for log_prob, Gt in zip(saved_log_probs, discounted_rewards):
policy_loss.append(-log_prob * Gt)
This important correction is compatible with the REINFORCE algorithm and leads to a faster and more stable training as shown in the figure.
I agree. Expected return G_t (sum of FUTURE awards) should be multiplied with each log p(At|St), which decreases as t increases, not the cumulative reward/episode R which is same for all t.
Hello,
In deep-reinforcement-learning/reinforce/REINFORCE.ipynb
R is implemented as a single value in the following code:
However it should be implemented as discounted values as follow:
This important correction is compatible with the REINFORCE algorithm and leads to a faster and more stable training as shown in the figure.