philtabor / Youtube-Code-Repository

Repository for most of the code from my YouTube channel
859 stars 479 forks source link

Issue with critic target in PPO #42

Open davidireland3 opened 2 years ago

davidireland3 commented 2 years ago

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Kin9L commented 2 years ago

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t can be used as the estimation of Value in time t.