Gymnasium support for DDPG continuous (+Jax)

arjun-kg commented 1 year ago

Description

Port ddpg_continuous_action.py and ddpg_continuous_action_jax.py to gymnasium.

Types of changes

[ ] Bug fix
[x] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).
[ ] I have updated the tests accordingly (if applicable).
[ ] I have updated the documentation and previewed the changes via mkdocs serve.
- [ ] I have explained note-worthy implementation details.
- [ ] I have explained the logged metrics.
- [ ] I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

[x] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
[x] I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture-video.
[x] I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - [x] I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [ ] I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- [ ] I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

Rlops report

python -m openrlbenchmark.rlops \
    --filters '?we=openrlbenchmark&wpn=cleanrl&ceik=env_id&cen=exp_name&metric=charts/episodic_return' \
        'ddpg_continuous_action?tag=pr-371' \
        'ddpg_continuous_action_jax?tag=pr-371-jax' \
    --env-ids Hopper-v2 Walker2d-v2 HalfCheetah-v2 \
    --check-empty-runs False \
    --ncols 3 \
    --ncols-legend 2 \
    --output-filename figures/0compare \
    --scan-history \
    --report

────────────────────────────────────────────────────────────────────────────────────── Runtime (m) (mean ± std) ──────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Environment    ┃ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']}) ┃ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hopper-v2      │ 82.48884665340242                                                    │ 97.04908408278409                                                            │
│ Walker2d-v2    │ 83.70214285646155                                                    │ 99.79698188415784                                                            │
│ HalfCheetah-v2 │ 84.70859018747274                                                    │ 99.89238566430278                                                            │
└────────────────┴──────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────────────────────────────────────────────── Episodic Return (mean ± std) ────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Environment    ┃ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']}) ┃ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hopper-v2      │ 1182.86 ± 58.52                                                      │ 1523.78 ± 201.77                                                             │
│ Walker2d-v2    │ 1174.04 ± 2.72                                                       │ 1254.34 ± 135.92                                                             │
│ HalfCheetah-v2 │ 10073.02 ± 615.81                                                    │ 10249.45 ± 373.49                                                            │
└────────────────┴──────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────────────────────────────────────────────────── Runtime (m) Average ─────────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Environment                                                                  ┃ Average Runtime   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']})         │ 83.63319323244558 │
│ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) │ 98.9128172104149  │
└──────────────────────────────────────────────────────────────────────────────┴───────────────────┘

https://wandb.ai/costa-huang/cleanrl/reports/Regression-Report-ddpg_continuous_action_jax--Vmlldzo0MjUwNDAx

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 3, 2023 6:38pm

arjun-kg commented 1 year ago

Feel free to start the RLops process.

https://wandb.ai/openrlbenchmark/cleanrl/reports/Regression-Report-ddpg_continuous_action--VmlldzozOTk4NzY1

This is for DDPG continuous. There seem to be somewhat significant differences but I'm not sure how to interpret them. I used gymnasium 0.28.1, numpy 1.24 (I later noticed poetry downgrading it to 1.21 so it might be significant, but there were some errors with this, so I had tried 1.24), and SB3 alpha1. Let me know what you think. I can re-run if needed.

vwxyzjn commented 1 year ago

@arjun-kg I think the report looks great. DDPG is definitely more unstable, so the results are expected. Feel free to update the docs and we can merge.

arjun-kg commented 1 year ago

@vwxyzjn That's great! Just started the runs for ddpg-jax, will update results of that as well soon. Do I need to update the results of the ddpg_continuous run / RLOps process anywhere?

arjun-kg commented 1 year ago

@vwxyzjn The results of RLOps for DDPG-Jax - https://wandb.ai/openrlbenchmark/cleanrl/reports/Regression-Report-ddpg_continuous_action_jax--Vmlldzo0MDE2NzA2

vwxyzjn commented 1 year ago

Looks great!

vwxyzjn commented 1 year ago

No sign of regression as shown in the PR description. Merging now.

vwxyzjn / cleanrl