[BUG] Transition pair misalignment in the `ppo_atari` example

fuyw commented 2 years ago

Describe the bug

In the ppo_atari example, we sample the action:

after we receive transitions from the train_envs:

Then the tuple (obs, act, rew, done, log_prob, value) is added to the batch.

However, it seems that obs = o_{t+1}, act = a_{t+1}, rew = r_t, done = d_t corepond to two different timestamps.

[x] I have checked that there is no similar issue in the repo (required)
[x] I have read the documentation (required)
[x] I have provided a minimal working example to reproduce the bug (required)

Trinkle23897 commented 2 years ago

correspond to two different timestamps.

In gae.py it corrects the order:

fuyw commented 2 years ago

Many thanks for the explanation.