ikostrikov / pytorch-a3c

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
MIT License
1.23k stars 279 forks source link

Why policy loss is negative? #30

Closed xuehy closed 7 years ago

xuehy commented 7 years ago

In train.py, the code for updating the policy loss is

 policy_loss = policy_loss - \
                log_probs[i] * Variable(gae) - 0.01 * entropies[i]

According to the original paper, I think it should be

 policy_loss = policy_loss + \
                log_probs[i] * Variable(gae) + 0.01 * entropies[i]
Kaixhin commented 7 years ago

In the paper, the equations are written assuming gradient ascent. As we use gradient descent optimisers, the equations have to be negated.