Issue with critic target in PPO

philtabor / Youtube-Code-Repository

Repository for most of the code from my YouTube channel

873 stars 479 forks source link

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t can be used as the estimation of Value in time t.

$GAE_t\approx{r_t+\gamma{V_{t+1}}-V_{t}}$

philtabor / Youtube-Code-Repository

Issue with critic target in PPO #42