openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
9.95k stars 2.19k forks source link

The compute_loss function is wrong for the Simplest Policy Gradient #414

Open alantpetrescu opened 3 months ago

alantpetrescu commented 3 months ago

I have been reading the 3 parts from "Introduction to RL" section and I have observed in part 3 that the compute_loss function for the Simplest Policy Gradient returns the mean of the product between the log probabilities of the actions taken by the agent and the weights of those actions, in other words, the finite-horizon undiscounted returns of the episodes in which they were taken.

image

In the estimation of the Basic Gradient Policy above, the sums of products is divided by the number of trajectories, but in the implementation, when you return the mean, the sums of products is divided by the number of all the actions taken across all the trajectories from one epoch. Maybe I am understanding this wrong, but I wanted to get a clear picture on the implementation.

image

hirodeng commented 2 months ago

I notice the same problem

earnesdm commented 1 week ago

@alantpetrescu I think you are correct that the equation written differs from what is implemented in code, but only by a constant multiple. Since we multiply the gradient estimate by the learning rate when performing gradient ascent, the constant multiple doesn't really matter.