vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
4.91k stars 566 forks source link

Gymnasium support for DDPG continuous (+Jax) #371

Closed arjun-kg closed 1 year ago

arjun-kg commented 1 year ago

Description

Port ddpg_continuous_action.py and ddpg_continuous_action_jax.py to gymnasium.

Types of changes

Checklist:

If you need to run benchmark experiments for a performance-impacting changes:

Rlops report

python -m openrlbenchmark.rlops \
    --filters '?we=openrlbenchmark&wpn=cleanrl&ceik=env_id&cen=exp_name&metric=charts/episodic_return' \
        'ddpg_continuous_action?tag=pr-371' \
        'ddpg_continuous_action_jax?tag=pr-371-jax' \
    --env-ids Hopper-v2 Walker2d-v2 HalfCheetah-v2 \
    --check-empty-runs False \
    --ncols 3 \
    --ncols-legend 2 \
    --output-filename figures/0compare \
    --scan-history \
    --report
────────────────────────────────────────────────────────────────────────────────────── Runtime (m) (mean ± std) ──────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Environment    ┃ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']}) ┃ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hopper-v2      │ 82.48884665340242                                                    │ 97.04908408278409                                                            │
│ Walker2d-v2    │ 83.70214285646155                                                    │ 99.79698188415784                                                            │
│ HalfCheetah-v2 │ 84.70859018747274                                                    │ 99.89238566430278                                                            │
└────────────────┴──────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────────────────────────────────────────────── Episodic Return (mean ± std) ────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Environment    ┃ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']}) ┃ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hopper-v2      │ 1182.86 ± 58.52                                                      │ 1523.78 ± 201.77                                                             │
│ Walker2d-v2    │ 1174.04 ± 2.72                                                       │ 1254.34 ± 135.92                                                             │
│ HalfCheetah-v2 │ 10073.02 ± 615.81                                                    │ 10249.45 ± 373.49                                                            │
└────────────────┴──────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┘
──────────────────────────────────────────────────────────────────────────────────────── Runtime (m) Average ─────────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Environment                                                                  ┃ Average Runtime   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ openrlbenchmark/cleanrl/ddpg_continuous_action ({'tag': ['pr-371']})         │ 83.63319323244558 │
│ openrlbenchmark/cleanrl/ddpg_continuous_action_jax ({'tag': ['pr-371-jax']}) │ 98.9128172104149  │
└──────────────────────────────────────────────────────────────────────────────┴───────────────────┘
image

https://wandb.ai/costa-huang/cleanrl/reports/Regression-Report-ddpg_continuous_action_jax--Vmlldzo0MjUwNDAx

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 3, 2023 6:38pm
arjun-kg commented 1 year ago

Feel free to start the RLops process.

https://wandb.ai/openrlbenchmark/cleanrl/reports/Regression-Report-ddpg_continuous_action--VmlldzozOTk4NzY1

This is for DDPG continuous. There seem to be somewhat significant differences but I'm not sure how to interpret them. I used gymnasium 0.28.1, numpy 1.24 (I later noticed poetry downgrading it to 1.21 so it might be significant, but there were some errors with this, so I had tried 1.24), and SB3 alpha1. Let me know what you think. I can re-run if needed.

vwxyzjn commented 1 year ago

@arjun-kg I think the report looks great. DDPG is definitely more unstable, so the results are expected. Feel free to update the docs and we can merge.

arjun-kg commented 1 year ago

@vwxyzjn That's great! Just started the runs for ddpg-jax, will update results of that as well soon. Do I need to update the results of the ddpg_continuous run / RLOps process anywhere?

arjun-kg commented 1 year ago

@vwxyzjn The results of RLOps for DDPG-Jax - https://wandb.ai/openrlbenchmark/cleanrl/reports/Regression-Report-ddpg_continuous_action_jax--Vmlldzo0MDE2NzA2

vwxyzjn commented 1 year ago

Looks great!

vwxyzjn commented 1 year ago

No sign of regression as shown in the PR description. Merging now.