thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.72k stars 1.11k forks source link

Episode start signal not used in RNN for on-policy algorithms #486

Open araffin opened 2 years ago

araffin commented 2 years ago

Hello,

It seems that episode start signals are currently not used (hidden state must be reset when such signal is encountered) when working with RNN and on-policy algorithms. This may explain poor performance reported in https://github.com/thu-ml/tianshou/issues/470#issuecomment-968737355.

More context: SB3-Contrib Implementation: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/53 CleanRL (@vwxyzjn implementation): https://github.com/vwxyzjn/cleanrl/pull/83

Trinkle23897 commented 2 years ago

Thanks! I'll take a look when I finish database assignment ......

Trinkle23897 commented 2 years ago

It seems that episode start signals are currently not used (hidden state must be reset when such signal is encountered) when working with RNN and on-policy algorithms.

I think the current implementation has already done this: https://github.com/thu-ml/tianshou/blob/3592f45446e6cc98423df2f1c28d8ca0ef2be821/tianshou/data/collector.py#L281-L282 Please correct me if I was wrong...

araffin commented 2 years ago

I think the current implementation has already done this:

this is for data collection only, the reset should be done when updating the networks too.

Trinkle23897 commented 2 years ago

But in training phase, state is set to None https://github.com/thu-ml/tianshou/blob/3592f45446e6cc98423df2f1c28d8ca0ef2be821/tianshou/utils/net/common.py#L256-L269

vwxyzjn commented 2 years ago

Maybe @araffin is saying the states could be reset multiple times during training, depending on the number of dones in the data collection?

Trinkle23897 commented 2 years ago

But reset_state operation only applies to the corresponding env_id with done=True, like:

env_id 0 1 2
done   T F T
state  0 / 0
vwxyzjn commented 2 years ago

Right, but you should apply the same principle during training. So a quick way to see if you have implemented it correctly is to print out the ratio. If during the first epoch and first mini batch the ratios are not all exactly 1s, it means the implementation is incorrect because you have not reproduced the computational graph used in rollouts.

Trinkle23897 commented 2 years ago

Do you mean in the sample trajectory, there would be done=True in the middle?

Actually this is never happening in the current implementation: only the last timestep may be done=True when calling buffer.sample(). And for the corner case: if all episode has length=3 but we would like to sample trajectory length=4:

# 0 1 2 0 1 2 0 1 2 ...
D F F T F F T F F T ...
sample from above (all possible cases with current implementation):
0 0 1 2
0 0 0 1
0 0 0 0
vwxyzjn commented 2 years ago

Do you mean in the sample trajectory, there would be done=True in the middle?

yes.

Actually this is never happening in the current implementation

This seems unusual to me and does not match the standard implementation… it also does not appear correct (when 0012 is fed to the agent, state 0 is fed twice?)

Trinkle23897 commented 2 years ago

when 0012 is fed to the agent, state 0 is fed twice?

yes...

This seems unusual to me and does not match the standard implementation…

I'll do the refactor in the following week.

vwxyzjn commented 2 years ago

yes...

yeah… This might cause some problems.

I'll do the refactor in the following week.

if you are looking for a simpler yet authentic reference than openai/baselines, checkout https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py (321 lines of code standalone implementation) and its corresponding tutorial https://youtu.be/MEt6rrxH8W4.