ufal / npfl122

NPFL122 repository
Creative Commons Attribution Share Alike 4.0 International
13 stars 23 forks source link

[06/slides] Wrong/misleading pseudocode for REINFORCE #117

Closed Fassty closed 1 year ago

Fassty commented 1 year ago

When studying the materials in slides for my diploma thesis I ran into a possible error or misleading formulation in the pseudocode for REINFORCE on slide https://ufal.mff.cuni.cz/~straka/courses/npfl122/2223/slides/?06#22

image

There is IMO an error on the last line in that the $\gamma^t$ is missing. It even says below the image "removing $\gamma^t$ from the update of $\theta$. However I'd argue that the update rule is now only valid in the non-discounted case, where $\gamma=1$. Let me explain.

Consider the definition of on-policy distribution for infinite horizon trajectories (the one not mentioned in the Sutton's book as they only define it for finite horizon non-discounted tasks).

$$ \mu(s) = \frac{\eta(s)}{\sum_{s^\prime} \eta(s^\prime)}, $$

where

$$ \eta(s) = h(s) + \sum{s^\prime} \eta(s^\prime) \sum{a} \gamma \pi(a|s^\prime)p(s|s^\prime,a). $$

When I expand the recursion I get

$$ \eta(s) = \\\ h(s) + \gamma \sum{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) + \gamma^2 \sum{s^\prime, a} \pi(a|s^\prime)p(s|s^\prime,a) \sum_{s^{\prime\prime}, a^\prime} \pi(a^\prime|s^{\prime\prime})p(s^\prime|s^{\prime\prime},a^\prime) \\

so I get the term $\gamma^t P(s_0 \rightarrow s_t \text{ in t steps})$ that is then used in the policy gradient theorem proof.

Now as I'm calculating the expectation under the on-policy distribution $\mu$ to get the policy gradient for REINFORCE the $\gamma^t$ is not present as it's already included in the probability $\mu(s)$. However when I want to estimate the expectation by the returns calculated over a sampled trajectory I believe that I need to include the term $\gamma^t$ again, otherwise the estimate will be biased. Or is my reasoning wrong here?

To be more clear what I mean is that it is true that:

$$ \nabla{\theta} J(\theta) \propto E{s\sim\mu} E_{a\sim\pi} \left[ Gt \nabla{\theta} \log \pi(a|s;\theta) \right] $$

but the corresponding update rule when estimating the expectations should be

$$ \theta = \theta + \alpha \gamma^t Gt \nabla{\theta} \log \pi(a_t|s_t;\theta) $$

foxik commented 1 year ago

Hi,

this is an extremely interesting question, I have been struggling with it when I was preparing the materials.

I am closing the issue, but we can continue the discussion here, if you like.