Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm

Josh00-Lu commented 7 months ago

Description

I have noticed that the issues regarding termination and truncation in the PPO (Proximal Policy Optimization) algorithm have not been addressed and resolved. Therefore, I have incorporated relevant handling methods. Specifically, for a vector environment (vec env), when termination or truncation is received at the current time step, the actual next state will be provided in the "final_observation", and we need to record this information. Also, here, done = termination or truncation.

On the other hand, when calculating delta and advantage using GAE (Generalized Advantage Estimation), delta should be computed using termination, while advantages should be computed with done. For more details, please refer to the PR (Pull Request) codes and comments I have provided.

Notes:

[x] Since we only next dones[t+1] and terminations[t+1] when computing advantages using GAE, I have make slight code adjustments to simplify the code logics.
[x] Now we need to store next_terminations and next_obs.

Types of changes

[x] Bug fix
[ ] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).
[ ] I have updated the tests accordingly (if applicable).
[ ] I have updated the documentation and previewed the changes via mkdocs serve.
- [x] I have explained note-worthy implementation details.
- [ ] I have explained the logged metrics.
- [ ] I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

[ ] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
[ ] I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
[ ] I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - [ ] I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [ ] I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- [ ] I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel[bot] commented 7 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 29, 2024 3:06pm

vwxyzjn commented 5 months ago

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Josh00-Lu commented 5 months ago

Hi @Josh00-Lu thanks for the PR. Have you had cases where handling termination / truncation has made a huge difference in the performance?

Yes. Especially for mujoco environments like swimmer, where truncation are encountered more frequently than termination. When termination is predominant, this problem is not as obvious (but explicit handling does not negatively affect performance). I think this improvement is intuitive, helping to use GAE for more reasonable Temporal Credit Assignment in PPO, as it explicitly recognizes the difference between termination and truncation. Admittedly, scaled extensive experiments may be needed to verify this.

gmarkkula commented 5 months ago

+1 on seeing cases where it makes a big difference. CleanRL consistently performed much worse than SB3 on one of my environments, but when I added the bootstrapping from terminal observation to CleanRL it achieved the same performance as SB3. See this comment in another thread on this topic.

varadVaidya commented 5 months ago

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Josh00-Lu commented 5 months ago

+1 on massive improvements using the PR's changes. I have a custom quadrotor improvement, that works well with SB3, but implementation of cleanRL gave poor performances, including catastrophic forgetting of some progress on the best seeds (the seeds which gave best performance on SB3). Using the PR's correct handling of the truncation, i have matching performance wrt SB3

Thank you. It's great to see that!

sdpkjc commented 5 months ago

Thank you for this PR! I had previously noticed the issue and made a simple modification to the CleanRL PPO by referencing the fix done in https://github.com/DLR-RM/stable-baselines3/pull/658. Here is my code: https://gist.github.com/sdpkjc/71cbba3bd439d754ef9c026c31658ecd#file-ppo_continuous_action-py-L431-L437

It only requires adding a few lines to the existing code.

I found that this modification offers limited performance improvements in the default Mujoco environments, but it yields benefits in scenarios that frequently involve truncation.

I also noticed that Costa @vwxyzjn seems to have been interested in this issue previously. In his blog post https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/, he mentioned this problem with PPO. However, he opted not to fix it in his implementation for the sake of fidelity to the original algorithm.

Whether or not to address this issue depends on whether CleanRL prioritizes higher performance or a more faithful implementation of the original algorithm. This is a topic worthy of our discussion. As far as I know, sb3 has addressed this issue.

Perhaps we could consider adding an argument to allow users to choose whether to handle this correctly.

vwxyzjn commented 5 months ago

Thanks all for the discussion. I had opted out to implement the correct behavior also because at some point I did a comprehensive eval and it did not seem to matter. There are also issues with efficient implementation. The current gymnasium's design is highly inefficient that we need to do extra forward passes whenever there is the terminal states.

https://github.com/sail-sg/envpool/issues/194#issuecomment-1317171243

pseudo-rnd-thoughts commented 4 months ago

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

Josh00-Lu commented 4 months ago

Sorry for the late message, personally, I would advocate for a separate file with this implementation if the original isn't wanting to be changed. This would prevent the problem people report above that they would need to make the change themselves. Second, @vwxyzjn Gymnasium v1.0.0 is updated to the vector implementation to match EnvPool in autoreset style, i.e., no more final_observation therefore, most of CleanRL will need to be updated to reflect this change. Meaning that the change suggested here will not cause the additional forward pass.

Therefore, my suggest would be to reopen this PR, include it in a new file, ppo_truncted.py?? and update to Gymnasium v1.0.0 within or after

I have re-opened this PR and included it in a new file called ppo_continuous_action_truncted.py while keeping the ppo_continuous_action.py unchanged. If other files in this repository are updated to Gymnasium v1.0.0, I will make the following adjustments to this file accordingly.

vwxyzjn / cleanrl