[06/slides] Wrong/misleading pseudocode for REINFORCE

When studying the materials in slides for my diploma thesis I ran into a possible error or misleading formulation in the pseudocode for REINFORCE on slide https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?06#22

There is IMO an error on the last line in that the $\gamma^t$ is missing. It even says below the image "removing $\gamma^t$ from the update of $\theta$. However I'd argue that the update rule is now only valid in the non-discounted case, where $\gamma=1$. Let me explain.

Consider the definition of on-policy distribution for infinite horizon trajectories (the one not mentioned in the Sutton's book as they only define it for finite horizon non-discounted tasks).

$$ \mu(s) = \frac{\eta(s)}{\sum_{s^\prime} \eta(s^\prime)}, $$

where

$$ \eta(s) = h(s) + \sum{s^\prime} \eta(s^\prime) \sum{a} \gamma \pi(a|s^\prime)p(s|s^\prime,a). $$

When I expand the recursion I get

$$ \eta(s) = \\\ h(s) + \gamma \sum{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) + \gamma^2 \sum{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) \sum_{s^{\prime\prime}, a^\prime} \pi(a^\prime|s^{\prime\prime})p(s^\prime|s^{\prime\prime},a^\prime) \\

\cdots $$

so I get the term $\gamma^t P(s_0 \rightarrow s_t \text{ in t steps})$ that is then used in the policy gradient theorem proof.

Now as I'm calculating the expectation under the on-policy distribution $\mu$ to get the policy gradient for REINFORCE the $\gamma^t$ is not present as it's already included in the probability $\mu(s)$. However when I want to estimate the expectation by the returns calculated over a sampled trajectory I believe that I need to include the term $\gamma^t$ again, otherwise the estimate will be biased. Or is my reasoning wrong here?

To be more clear what I mean is that it is true that:

$$ \nabla{\theta} J(\theta) \propto E{s\sim\mu} E_{a\sim\pi} \left[ Gt \nabla{\theta} \log \pi(a|s;\theta) \right] $$

but the corresponding update rule when estimating the expectations should be

$$ \theta = \theta + \alpha \gamma^t Gt \nabla{\theta} \log \pi(a_t|s_t;\theta) $$

Hi,

this is an extremely interesting question, I have been struggling with it when I was preparing the materials.

I agree that when using the policy gradient theorem directly, the $\gamma^t$ should be there, as it is there in the algorithm in the Sutton&Barto book.
However, the term is not used in any practical implementation, as far as I know (consider for example A3C, Impala, ..., anything, that solves environments with more than hundreds of steps). The problem is that if you use it, it will not work (i.e., you cannot train an Atari agent for most games when $\gamma^t$ is present for $\gamma<1$).
I comment the issue on the slide https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?06#24 (next-next slide after the one you referenced). The following points have a big intersection with the slide:
- Formally, we can consider $\gamma < 1$ to be a form of termination, i.e., https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?04#17. This shows that $\gamma< 1$ can change the optimal policy (and it is the reason for including $\gamma$ in the definition of MDP, and not just a hyperparameter of the algorithm).
- However, during evaluation, discounting is usually not used, i.e., $\gamma=1$. In Atari for example, we consider 30 minutes of gameplay -- with $\gamma =0.999$ we would consider just thousands of frames, which would be tens of seconds.
- So if we trained an agent maximizing $$\mathbb{E}_{s\sim\textrm{initial state distribution}}[\textrm{discounted return}(s)]$$
with discounted return, it would not be able to handle states further from the initial states (because their influence would be multiplied by $\gamma^t$).

Therefore, we usually use a different gradient -- we consider discounted return in states coming from non-discounted on-policy distribution. One would like to say that it maximizes the objective $$\mathbb{E}_{s \sim \textrm{undiscounted on-policy distribution}} [\textrm{discounted return}(s)]$$ but that is actually not true (see below).
- Actually, even DQN uses the same approach. If we wanted to maximize discounted return, we should also use discounted on-policy distribution (the one where we terminate with probablity of $1-\gamma$), instead of the undiscounted on-policy distribution used by the DQN/Rainbow/IQN/... algorithms.
I first found a reference to this problem in Appendix A of https://arxiv.org/pdf/1801.01290.pdf, but it was discussed much earlier (see for example Thomas, 2014, Bias in Natural Actor-Critic Algorithms; I have not really read it yet)
The paper https://arxiv.org/pdf/1906.07073.pdf shows that the "undiscounted distribution discounted rewards" approach is not a gradient of any loss.
In https://arxiv.org/pdf/2212.14066.pdf, it is however shown that the error of the discounted approximation is bounded by a multiple of $1 - \gamma$, so if we increase the discounting factor suitably, the discounted approximation converges to the undiscounted one.

I am closing the issue, but we can continue the discussion here, if you like.

ufal / npfl122

[06/slides] Wrong/misleading pseudocode for REINFORCE #117