vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.02k stars 575 forks source link

Add RPO to CleanRL #331

Closed masud99r closed 1 year ago

masud99r commented 1 year ago

Description

Types of changes

Checklist:

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jan 4, 2023 at 11:27PM (UTC)
vwxyzjn commented 1 year ago

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering. https://github.com/vwxyzjn/cleanrl/blob/b558b2b48326d8bb8da7fa853914576ae4610f53/.github/workflows/tests.yaml#L154-L156

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

masud99r commented 1 year ago

Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'. You just need to install some mujoco dependencies for rendering.

https://github.com/vwxyzjn/cleanrl/blob/b558b2b48326d8bb8da7fa853914576ae4610f53/.github/workflows/tests.yaml#L154-L156

See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.

I have tried these commands on my local machine, but the issue persists. I just wanted to give an update.

vwxyzjn commented 1 year ago

It might be the case you are using hardware-accelerated rendering. Maybe look up https://github.com/deepmind/dm_control#rendering and https://github.com/deepmind/dm_control/issues/136

masud99r commented 1 year ago

Hi @vwxyzjn , did you get around this issue? https://github.com/Farama-Foundation/Shimmy/issues/15#issue-1446206487 I am facing a similar issue with dm_control.

vwxyzjn commented 1 year ago

Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment) I am facing a similar issue with dm_control.

Actually yes, I can reproduce this error on my end. For now try running the code without --capture-video.

masud99r commented 1 year ago

Here is the result comparison between PPO and RPO. These dm_control environments are used in the RPO paper. The RPO paper: https://arxiv.org/pdf/2212.07536.pdf.

PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

ppo_continuous_action ({'tag': ['v1.0.0-13-gcbd83f6']}) rpo_continuous_action ({'tag': ['pr-331']})
dm_control/humanoid-stand-v0 21.41 ± 30.31 140.84 ± 57.21
dm_control/humanoid-run-v0 5.47 ± 9.32 23.82 ± 19.57
dm_control/hopper-hop-v0 6.33 ± 18.58 62.46 ± 88.61
dm_control/quadruped-walk-v0 241.46 ± 92.15 610.52 ± 215.04
dm_control/quadruped-run-v0 172.81 ± 40.87 370.70 ± 119.32
dm_control/quadruped-escape-v0 22.78 ± 11.22 67.82 ± 28.01
dm_control/walker-stand-v0 451.71 ± 220.04 724.61 ± 158.48
dm_control/walker-run-v0 124.14 ± 88.88 389.88 ± 126.93
dm_control/fish-swim-v0 83.03 ± 8.93 130.44 ± 51.66

Learning curve: ppo_continuous_action_slurm is the ppo_continuous_action.py run for 8M timesteps and 10 random seeds.

dm_control_ppo_rpo

The rest of the dm_control experiments are running and will be added later.

vwxyzjn commented 1 year ago

This is perfect. Looking forward to seeing the rest of the results. A quick tip is that instead of creating a separate ppo_continuous_action_slurm file, you could do python ppo_continuous_action.py --exp-name ppo_continuous_action_8M --total-timesteps 8000000

Feel free to prepare the documentation in /docs/rl-algorithms/rpo.md. https://github.com/vwxyzjn/cleanrl/blob/master/docs/rl-algorithms/ppg.md might be a good example.

masud99r commented 1 year ago

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action. Using the arg --exp-name is a better option. Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the ppo_continuous_action.py to make ppo_continuous_action_slurm.py. Is there a better way of doing that? Is there any arg that I can use?

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

vwxyzjn commented 1 year ago

Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action.

This used to be a concern but not anymore since we are now doing filter-based workflow.

Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the

--exp-name should direct files to folders prefixed with exp_name, so it should have identical effects as creating a ppo_continuous_action_slurm.py. One other thing you might leverage is you could call the file from any directory and it will save the tensorboard and video files at that directory. For example, calling python cleanrl/ppo.py vs cd cleanrl && python ppo.py will have the storage files in different directories.

https://github.com/vwxyzjn/cleanrl/blob/2dd73af4a8f8c02de3d75f0f7f6e692dcf697273/cleanrl/ppo_continuous_action.py#L140-L153

As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

masud99r commented 1 year ago

There is no need to delete those experiments because the last benchmarked experiments are tagged with latest, which will be used for querying. So having these ppo_continuous_action_slurm.py would not affect any existing workflows. I would say that you should remove failed and crashed experiments.

I am thinking about this PR. I have created some additional files (run_rpo.sh, ppo_continuous_action_slurm.py, etc.) to facilitate running the experiments. My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

vwxyzjn commented 1 year ago

My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.

That is correct. To keep track of the ppo_continuous_action_slurm.py experiments, I would say update the ppo.sh to add lines such as

OMP_NUM_THREADS=1 xvfb-run -a poetry run python -m cleanrl_utils.benchmark \
     --env-ids dm_control/acrobot-swingup-v0 dm_control/acrobot-swingup_sparse-v0 dm_control/ball_in_cup-catch-v0 dm_control/cartpole-balance-v0 dm_control/cartpole-balance_sparse-v0 dm_control/cartpole-swingup-v0 dm_control/cartpole-swingup_sparse-v0 dm_control/cartpole-two_poles-v0 dm_control/cartpole-three_poles-v0 dm_control/cheetah-run-v0 dm_control/dog-stand-v0 dm_control/dog-walk-v0 dm_control/dog-trot-v0 dm_control/dog-run-v0 dm_control/dog-fetch-v0 dm_control/finger-spin-v0 dm_control/finger-turn_easy-v0 dm_control/finger-turn_hard-v0 dm_control/fish-upright-v0 dm_control/fish-swim-v0 dm_control/hopper-stand-v0 dm_control/hopper-hop-v0 dm_control/humanoid-stand-v0 dm_control/humanoid-walk-v0 dm_control/humanoid-run-v0 dm_control/humanoid-run_pure_state-v0 dm_control/humanoid_CMU-stand-v0 dm_control/humanoid_CMU-run-v0 dm_control/lqr-lqr_2_1-v0 dm_control/lqr-lqr_6_2-v0 dm_control/manipulator-bring_ball-v0 dm_control/manipulator-bring_peg-v0 dm_control/manipulator-insert_ball-v0 dm_control/manipulator-insert_peg-v0 dm_control/pendulum-swingup-v0 dm_control/point_mass-easy-v0 dm_control/point_mass-hard-v0 dm_control/quadruped-walk-v0 dm_control/quadruped-run-v0 dm_control/quadruped-escape-v0 dm_control/quadruped-fetch-v0 dm_control/reacher-easy-v0 dm_control/reacher-hard-v0 dm_control/stacker-stack_2-v0 dm_control/stacker-stack_4-v0 dm_control/swimmer-swimmer6-v0 dm_control/swimmer-swimmer15-v0 dm_control/walker-stand-v0 dm_control/walker-walk-v0 dm_control/walker-run-v0 \
     --command "poetry run python cleanrl/ppo_continuous_action.py --exp-name ppo_continuous_action_slurm --cuda False --track" \
     --num-seeds 3 \
     --workers 9

I would however say that the name ppo_continuous_action_slurm is quite hacky... If it is not too much trouble it maybe preferable to re-run the experiments with the names ppo_continuous_action_8M, but it's not a big deal to me if you want to keep the current ones. We just need to add a short explanation of it in the docs.

masud99r commented 1 year ago

Re-running would not be a problem. However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right? In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system. I will edit the command in ppo.sh as you mention.

vwxyzjn commented 1 year ago

Re-running would not be a problem. However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M instead of ppo_continuous_action_slurm. Is that right?

That’s right. The exp_name needs to be ppo_continuous_action_8M for filtering.

In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py to ppo_continuous_action_8M.py. This is because I need to store the experiment data in a storage-only file system.

Sounds good. Thanks.

masud99r commented 1 year ago

Results on all dm_control environments. The PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.

Table: ppo_continuous_action_8M ({'tag': ['v1.0.0-13-gcbd83f6']}) rpo_continuous_action ({'tag': ['pr-331']})
dm_control/acrobot-swingup-v0 25.12 ± 7.77 41.62 ± 5.26
dm_control/acrobot-swingup_sparse-v0 1.71 ± 0.74 3.25 ± 0.91
dm_control/ball_in_cup-catch-v0 937.71 ± 8.68 941.16 ± 9.67
dm_control/cartpole-balance-v0 791.93 ± 13.67 797.17 ± 9.63
dm_control/cartpole-balance_sparse-v0 990.23 ± 7.34 989.03 ± 5.20
dm_control/cartpole-swingup-v0 583.60 ± 13.82 615.99 ± 11.85
dm_control/cartpole-swingup_sparse-v0 240.45 ± 299.08 526.18 ± 190.39
dm_control/cartpole-two_poles-v0 217.71 ± 5.15 219.31 ± 6.76
dm_control/cartpole-three_poles-v0 159.93 ± 2.12 160.10 ± 2.36
dm_control/cheetah-run-v0 473.61 ± 107.87 562.38 ± 59.67
dm_control/dog-stand-v0 332.40 ± 24.13 504.45 ± 135.58
dm_control/dog-walk-v0 124.68 ± 23.05 166.50 ± 45.85
dm_control/dog-trot-v0 80.08 ± 13.04 115.28 ± 30.15
dm_control/dog-run-v0 67.93 ± 6.84 104.17 ± 24.32
dm_control/dog-fetch-v0 28.73 ± 4.70 44.43 ± 7.21
dm_control/finger-spin-v0 628.08 ± 253.43 849.49 ± 23.43
dm_control/finger-turn_easy-v0 248.87 ± 80.39 447.46 ± 129.80
dm_control/finger-turn_hard-v0 82.28 ± 34.94 238.67 ± 134.88
dm_control/fish-upright-v0 541.82 ± 64.54 801.81 ± 35.79
dm_control/fish-swim-v0 84.81 ± 6.90 139.70 ± 40.24
dm_control/hopper-stand-v0 3.11 ± 1.63 408.24 ± 203.25
dm_control/hopper-hop-v0 5.84 ± 17.28 63.69 ± 87.18
dm_control/humanoid-stand-v0 21.13 ± 30.14 140.61 ± 58.36
dm_control/humanoid-walk-v0 8.32 ± 17.00 77.51 ± 50.81
dm_control/humanoid-run-v0 5.49 ± 9.20 24.16 ± 19.90
dm_control/humanoid-run_pure_state-v0 1.10 ± 0.13 3.32 ± 2.50
dm_control/humanoid_CMU-stand-v0 4.65 ± 0.30 4.34 ± 0.27
dm_control/humanoid_CMU-run-v0 0.86 ± 0.08 0.85 ± 0.04
dm_control/manipulator-bring_ball-v0 0.41 ± 0.22 0.55 ± 0.37
dm_control/manipulator-bring_peg-v0 0.57 ± 0.24 2.64 ± 2.19
dm_control/manipulator-insert_ball-v0 41.53 ± 15.58 39.24 ± 17.19
dm_control/manipulator-insert_peg-v0 49.89 ± 14.26 53.23 ± 16.02
dm_control/pendulum-swingup-v0 464.82 ± 379.09 776.21 ± 20.15
dm_control/point_mass-easy-v0 530.17 ± 262.14 652.18 ± 21.93
dm_control/point_mass-hard-v0 132.29 ± 69.71 184.29 ± 24.17
dm_control/quadruped-walk-v0 239.11 ± 85.20 600.71 ± 213.40
dm_control/quadruped-run-v0 165.70 ± 43.21 375.78 ± 119.95
dm_control/quadruped-escape-v0 23.82 ± 10.82 68.19 ± 30.69
dm_control/quadruped-fetch-v0 183.03 ± 37.50 227.47 ± 21.67
dm_control/reacher-easy-v0 769.10 ± 48.35 740.93 ± 51.58
dm_control/reacher-hard-v0 636.89 ± 77.68 584.33 ± 43.76
dm_control/stacker-stack_2-v0 61.73 ± 17.34 64.32 ± 7.15
dm_control/stacker-stack_4-v0 71.68 ± 29.03 53.78 ± 20.09
dm_control/swimmer-swimmer6-v0 152.82 ± 30.77 166.89 ± 14.49
dm_control/swimmer-swimmer15-v0 148.59 ± 22.63 153.45 ± 17.12
dm_control/walker-stand-v0 453.73 ± 217.72 736.72 ± 145.16
dm_control/walker-walk-v0 308.91 ± 87.85 785.61 ± 133.29
dm_control/walker-run-v0 123.98 ± 91.34 384.08 ± 123.89

Learning curves: dm_control_all_ppo_rpo_8M

vwxyzjn commented 1 year ago

The results look exceptionally clean 👍. After all the reported results, I would feel comfortable marking the existing ppo_continuous_action.py as "deprecated" or "not recommended" in favor of rpo_continuous_action.py because of the simplicity of the approach and the superiority of the empirical results.

Please update the docs with these results. Please also remove the ppo_continuous_action_slurm.py results from the project, as they are no longer needed.

Thank you, and have a great holiday.

Howuhh commented 1 year ago

Seeing how much this improves the scores, I'm now really curious what happens if we change from Gaussian to the Beta distribution... Or simply clip log std, so that std will not be able to go down to too small value

masud99r commented 1 year ago

@Howuhh Trying Beta distribution would be an interesting idea. I observed some performance effects when trying with Laplace and Gumbel (https://arxiv.org/pdf/2212.07536.pdf Figure 7 & 8).

An aspect of the RPO method is maintaining randomness through Uniform distribution. It would be interesting to see if clip log std can achieve a similar effect.

masud99r commented 1 year ago

Hi @vwxyzjn, I have added a doc file named rpo.md. Please review it when you have some time.

masud99r commented 1 year ago

Resolved. Thank you for the suggestion.

poetry and BipedalWalker-v3

If we want to include the BipedalWalker-v3 experiments, we should also update the dependencies to include box2d-py for reproducibility purposes, but that seems to cause an issue for poetry (see Farama-Foundation/Gymnasium#86 (comment)). What we can do instead is to specify the installation in the Usage section of the docs. Something like

# mujoco v4 environments
poetry install --with mujoco
python cleanrl/rpo_continuous_action.py --help
python cleanrl/rpo_continuous_action.py --env-id Hopper-v2
# dm_control v4 environments
poetry install --with mujoco,dm_control
python cleanrl/rpo_continuous_action.py --env-id dm_control/cartpole-balance-v0
# BipedalWalker-v3 experiment (hack)
poetry install
poetry run pip install box2d-py==some-version
python cleanrl/rpo_continuous_action.py --env-id BipedalWalker-v3
masud99r commented 1 year ago

Help needed: The docs work as expected, except when I add iframe to incorporate tracked experiments. When I add tracked experiments iframe the docs show the first iframe tracked results and break from there. Do you have any idea what I am missing here? @vwxyzjn

RPO Doc: https://github.com/masud99r/cleanrl/blob/master/docs/rl-algorithms/rpo.md

masud99r commented 1 year ago
masud99r commented 1 year ago

Thanks for the fix. Were those tab/space issues or something different?

Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.

I have added a warning section with the depreciation notice.

With that, I think I have addressed all the comments, but I might lose some in the thread! Could you please take a look and let me know what else to be done for this PR?

masud99r commented 1 year ago

LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.

It was my pleasure working on this. Thanks for your excellent guidance throughout the process. I have learned a lot.

I took a pass on the docs and fixed some typos. I am going to merge.