The GAE calculation in PPO continuous action

vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

http://docs.cleanrl.dev

Other

5.41k stars 616 forks source link

The GAE calculation in PPO continuous action #393

Closed baixianger closed 1 year ago

baixianger commented 1 year ago

Problem Description

The GAE calculation in PPO continuous action

I think the done value here should not be the next step's done. In fact, it should be the right now done.

baixianger commented 1 year ago

If this step is done, the next_value should equal 0. SO, Here the two situations:

if done=True: 
    delta=r-v(s)
if done=False:
    delta=r+gamma*v(s')-v(s)

vwxyzjn commented 1 year ago

Hi @baixianger thanks for raising the issue. However, next_done and dones[t+1] should be different values.

I think the done value here should not be the next step's done. In fact, it should be the right now done. If this step is done, the next_value should equal 0.

Could you be specific on "this step" and "done value here"? What timestep is this step? Done values appear multiple times. Which values are you referring to?

baixianger commented 1 year ago

Hi @baixianger thanks for raising the issue. However, next_done and dones[t+1] should be different values.

I think the done value here should not be the next step's done. In fact, it should be the right now done. If this step is done, the next_value should equal 0.

Could you be specific on "this step" and "done value here"? What timestep is this step? Done values appear multiple times. Which values are you referring to?

For example. At time_step t, we should use the done value from this timestep, not the done value from the next timestep t+1. The value of timestep t+1 is equal to 0 or not depending on the done from timestep t not timestep t+1. But I guess in your code, you use the done value from timestep t+1.

\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t}) * \text{value}_{t+1} - \text{value}_{t}

I suppose yours are

\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t+1}) * \text{value}_{t+1} - \text{value}_{t}

Thanks for your reply.

vwxyzjn commented 1 year ago

There might be some misalignment on timestep index. The openai/Berkeley uses the following notation

https://github.com/vwxyzjn/cleanrl/blob/8a80d143ca27e6e911bb0eb8efe83704c835c025/cleanrl/ppo.py#L202-L204
s_1, d_1 executes a_1 gets r_1, s_2, d_2, 
s_2, d_2 executes a_2 gets r_2, s_3, d_3

Note that d_2 indicates if a_1 results in the episode being done, so your second formula should be right.

\delta = \text{reward} + \gamma * (1.0 - \text{done}_{t+1}) * \text{value}_{t+1} - \text{value}_{t}

baixianger commented 1 year ago

There might be some misalignment on timestep index. The openai/Berkeley uses the following notation
https://github.com/vwxyzjn/cleanrl/blob/8a80d143ca27e6e911bb0eb8efe83704c835c025/cleanrl/ppo.py#L202-L204
s_1, d_1 executes a_1 gets r_1, s_2, d_2, 
s_2, d_2 executes a_2 gets r_2, s_3, d_3
Note that d_2 indicates if a_1 results in the episode being done, so your second formula should be right.

Thank you for your patient explanation.

vwxyzjn commented 1 year ago

My pleasure! Closing the issue for now.