Closed xuehy closed 7 years ago
In train.py, the code for updating the policy loss is
policy_loss = policy_loss - \ log_probs[i] * Variable(gae) - 0.01 * entropies[i]
According to the original paper, I think it should be
policy_loss = policy_loss + \ log_probs[i] * Variable(gae) + 0.01 * entropies[i]
In the paper, the equations are written assuming gradient ascent. As we use gradient descent optimisers, the equations have to be negated.
In train.py, the code for updating the policy loss is
According to the original paper, I think it should be