udacity / deep-reinforcement-learning

Repo for the Deep Reinforcement Learning Nanodegree program
https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
MIT License
4.9k stars 2.34k forks source link

REINFORCE Correction #42

Closed IbrahimSobh closed 2 years ago

IbrahimSobh commented 5 years ago

Hello,

In deep-reinforcement-learning/reinforce/REINFORCE.ipynb

R is implemented as a single value in the following code:

discounts = [gamma**i for i in range(len(rewards)+1)]
R = sum([a*b for a,b in zip(discounts, rewards)])

However it should be implemented as discounted values as follow:

        discounted_rewards = []
        for t in range(len(rewards)):
          Gt = 0 
          pwr = 0
          for r in rewards[t:]:
              Gt = Gt + gamma**pwr * r
              pwr = pwr + 1
          discounted_rewards.append(Gt)

        policy_loss = []
        for log_prob, Gt in zip(saved_log_probs, discounted_rewards):
          policy_loss.append(-log_prob * Gt)

This important correction is compatible with the REINFORCE algorithm and leads to a faster and more stable training as shown in the figure.

reinforce

lukysummer commented 4 years ago

I agree. Expected return G_t (sum of FUTURE awards) should be multiplied with each log p(At|St), which decreases as t increases, not the cumulative reward/episode R which is same for all t.