vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
4.84k stars 560 forks source link

Correct handling of `termination` vs `truncation`? #457

Open ankile opened 2 months ago

ankile commented 2 months ago

Hi, thank you so much for the CleanRL resource!

I have a question regarding the PPO implementation and how it handles the difference between episodes that ended because it was terminated (it completed the task) or truncated (it ran out of time).

A comment in the advantage calculation suggests that episodes that are not done are to be bootstrapped from the value function.

At the same time, both truncations and terminations are or'd together so both cases are counted as the same type of done:

https://github.com/vwxyzjn/cleanrl/blob/8cbca61360ef98660f149e3d76762350ce613323/cleanrl/ppo_continuous_action.py#L221

This seems to go against other findings/implementations: Time Limits in Reinforcement Learning, StableBaselines3.

Is the difference here that you assume that we're operating in environments with an actual episode timeout so that truncations mean failure? In other cases, there is no inherent sense of time-limit, only a designer desire for faster task solving, in which I think it makes sense to handle truncations separately.

Have I understood all of this correctly?

pseudo-rnd-thoughts commented 2 months ago

I believe this is being fixed here - https://github.com/vwxyzjn/cleanrl/pull/448