vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.4k stars 617 forks source link

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448

Open Josh00-Lu opened 7 months ago

Josh00-Lu commented 7 months ago

Description

I have noticed that the issues regarding termination and truncation in the PPO (Proximal Policy Optimization) algorithm have not been addressed and resolved. Therefore, I have incorporated relevant handling methods. Specifically, for a vector environment (vec env), when termination or truncation is received at the current time step, the actual next state will be provided in the "final_observation", and we need to record this information. Also, here, done = termination or truncation.

On the other hand, when calculating delta and advantage using GAE (Generalized Advantage Estimation), delta should be computed using termination, while advantages should be computed with done. For more details, please refer to the PR (Pull Request) codes and comments I have provided.

Notes:

Types of changes

Checklist:

If you need to run benchmark experiments for a performance-impacting changes:

vercel[bot] commented 7 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 29, 2024 3:06pm
vwxyzjn commented 5 months ago

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Josh00-Lu commented 5 months ago

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Yes. Especially for mujoco environments like swimmer, where truncation are encountered more frequently than termination. When termination is predominant, this problem is not as obvious (but explicit handling does not negatively affect performance). I think this improvement is intuitive, helping to use GAE for more reasonable Temporal Credit Assignment in PPO, as it explicitly recognizes the difference between termination and truncation. Admittedly, scaled extensive experiments may be needed to verify this.

gmarkkula commented 5 months ago

+1 on seeing cases where it makes a big difference. CleanRL consistently performed much worse than SB3 on one of my environments, but when I added the bootstrapping from terminal observation to CleanRL it achieved the same performance as SB3. See this comment in another thread on this topic.

varadVaidya commented 5 months ago

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Josh00-Lu commented 5 months ago

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Thank you. It's great to see that!

sdpkjc commented 5 months ago

Thank you for this PR! I had previously noticed the issue and made a simple modification to the CleanRL PPO by referencing the fix done in https://github.com/DLR-RM/stable-baselines3/pull/658. Here is my code: https://gist.github.com/sdpkjc/71cbba3bd439d754ef9c026c31658ecd#file-ppo_continuous_action-py-L431-L437

It only requires adding a few lines to the existing code.

I found that this modification offers limited performance improvements in the default Mujoco environments, but it yields benefits in scenarios that frequently involve truncation.

I also noticed that Costa @vwxyzjn seems to have been interested in this issue previously. In his blog post https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/, he mentioned this problem with PPO. However, he opted not to fix it in his implementation for the sake of fidelity to the original algorithm.

image

Whether or not to address this issue depends on whether CleanRL prioritizes higher performance or a more faithful implementation of the original algorithm. This is a topic worthy of our discussion. As far as I know, sb3 has addressed this issue.

Perhaps we could consider adding an argument to allow users to choose whether to handle this correctly.

vwxyzjn commented 5 months ago

Thanks all for the discussion. I had opted out to implement the correct behavior also because at some point I did a comprehensive eval and it did not seem to matter. There are also issues with efficient implementation. The current gymnasium's design is highly inefficient that we need to do extra forward passes whenever there is the terminal states.

https://github.com/sail-sg/envpool/issues/194#issuecomment-1317171243

pseudo-rnd-thoughts commented 4 months ago

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

Josh00-Lu commented 4 months ago

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

I have re-opened this PR and included it in a new file called ppo_continuous_action_truncted.py while keeping the ppo_continuous_action.py unchanged. If other files in this repository are updated to Gymnasium v1.0.0, I will make the following adjustments to this file accordingly.