Closed fuyw closed 2 years ago
In the ppo_atari example, we sample the action:
ppo_atari
action
https://github.com/sail-sg/envpool/blob/ea86c2b77d12aaa58725bfeb1d701e3207f11822/examples/ppo_atari/ppo.py#L236
after we receive transitions from the train_envs:
train_envs
https://github.com/sail-sg/envpool/blob/ea86c2b77d12aaa58725bfeb1d701e3207f11822/examples/ppo_atari/ppo.py#L231
Then the tuple (obs, act, rew, done, log_prob, value) is added to the batch.
(obs, act, rew, done, log_prob, value)
However, it seems that obs = o_{t+1}, act = a_{t+1}, rew = r_t, done = d_t corepond to two different timestamps.
obs = o_{t+1}
act = a_{t+1}
rew = r_t
done = d_t
correspond to two different timestamps.
In gae.py it corrects the order:
gae.py
https://github.com/sail-sg/envpool/blob/ea86c2b77d12aaa58725bfeb1d701e3207f11822/examples/ppo_atari/gae.py#L41-L48
Many thanks for the explanation.
Describe the bug
In the
ppo_atari
example, we sample theaction
:https://github.com/sail-sg/envpool/blob/ea86c2b77d12aaa58725bfeb1d701e3207f11822/examples/ppo_atari/ppo.py#L236
after we receive transitions from the
train_envs
:https://github.com/sail-sg/envpool/blob/ea86c2b77d12aaa58725bfeb1d701e3207f11822/examples/ppo_atari/ppo.py#L231
Then the tuple
(obs, act, rew, done, log_prob, value)
is added to the batch.However, it seems that
obs = o_{t+1}
,act = a_{t+1}
,rew = r_t
,done = d_t
corepond to two different timestamps.Checklist