huggingface / deep-rl-class

This repo contains the syllabus of the Hugging Face Deep Reinforcement Learning Course.
Apache License 2.0
3.8k stars 581 forks source link

Unit4 policy gradient errors #285

Open dylwil3 opened 1 year ago

dylwil3 commented 1 year ago

(Updated for clarity) Apologies if I'm wrong, but it seems to me that there are some mathematical issues in unit 4 "diving deeper..." as well as in the optional section on the proof of the policy gradient theorem.

More major error

The main background issue seems to be an instance of the problems discussed in:

Specifically:

  1. There are two types of objective functions $J(\theta)$: you could take the sum of all rewards $r_0+r_1+r_2\cdots$, or you could take a discounted sum like $r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$. The former is preferred for interpretability. However, it leads to an update algorithm with $\gamma = 1$. This is problematic because we use an estimate on the returns after a given action which is less reliable as we move further from the action taken. If we instead used the latter objective function, we would obtain an update as in Sutton and Barto page 328, which does not agree with the REINFORCE algorithm used in practice because of the additional factor of $\gamma^t$ appearing in the update.
  2. In practice, the algorithm used is the one implemented in your course and, e.g., in stablebaselines. But it is worth noting that it does not update by the gradient of any objective function. That is the result in the first linked paper above. The second linked paper explains that, in general, we can only expect the REINFORCE algorithm to be stable and converge to the expected thing if we use a schedule for both the learning rate and $\gamma$ which moves the learning rate to zero (roughly harmonically) and moves $\gamma$ towards $1$ (in a manner controlled by the learning rate). See Theorem 2 in loc. cit. I haven't seen anyone actually implement that algorithm anywhere... maybe it's not worth it for the sake of this course.

I am not sure how you'd like to proceed. One possibility would be to do the entire discussion with $\gamma = 1$ so that there aren't mathematical errors, and then when implementing the algorithm say something like "We replace the return $G_t$ with the discounted version to account for variance in estimating rewards at later time steps."

Specific, smaller errors

simoninithomas commented 1 year ago

Hey there 👋 thanks for the issue. I'm adding it to the next big update (next week) todo list🤗