Open davidireland3 opened 2 years ago
In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?
My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?
Thanks!
Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t
can be used as the estimation of Value in time t.
In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?
My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?
Thanks!