Closed baixianger closed 1 year ago
If this step is done, the next_value should equal 0. SO, Here the two situations:
if done=True:
delta=r-v(s)
if done=False:
delta=r+gamma*v(s')-v(s)
Hi @baixianger thanks for raising the issue. However, next_done
and dones[t+1]
should be different values.
I think the done value here should not be the next step's done. In fact, it should be the right now done. If this step is done, the next_value should equal 0.
Could you be specific on "this step" and "done value here"? What timestep is this step? Done values appear multiple times. Which values are you referring to?
Hi @baixianger thanks for raising the issue. However,
next_done
anddones[t+1]
should be different values.I think the done value here should not be the next step's done. In fact, it should be the right now done. If this step is done, the next_value should equal 0.
Could you be specific on "this step" and "done value here"? What timestep is this step? Done values appear multiple times. Which values are you referring to?
For example.
At time_step t
, we should use the done value from this timestep, not the done value from the next timestep t+1
. The value of timestep t+1
is equal to 0 or not depending on the done from timestep t
not timestep t+1
. But I guess in your code, you use the done value from timestep t+1
.
\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t}) * \text{value}_{t+1} - \text{value}_{t}
I suppose yours are
\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t+1}) * \text{value}_{t+1} - \text{value}_{t}
Thanks for your reply.
There might be some misalignment on timestep index. The openai/Berkeley uses the following notation
https://github.com/vwxyzjn/cleanrl/blob/8a80d143ca27e6e911bb0eb8efe83704c835c025/cleanrl/ppo.py#L202-L204
s_1, d_1 executes a_1 gets r_1, s_2, d_2,
s_2, d_2 executes a_2 gets r_2, s_3, d_3
Note that d_2
indicates if a_1
results in the episode being done, so your second formula should be right.
\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t+1}) * \text{value}_{t+1} - \text{value}_{t}
There might be some misalignment on timestep index. The openai/Berkeley uses the following notation
https://github.com/vwxyzjn/cleanrl/blob/8a80d143ca27e6e911bb0eb8efe83704c835c025/cleanrl/ppo.py#L202-L204 s_1, d_1 executes a_1 gets r_1, s_2, d_2, s_2, d_2 executes a_2 gets r_2, s_3, d_3
Note that
d_2
indicates ifa_1
results in the episode being done, so your second formula should be right.
Thank you for your patient explanation.
My pleasure! Closing the issue for now.
Problem Description
The GAE calculation in PPO continuous action
I think the done value here should not be the next step's done. In fact, it should be the right now done.