Closed masud99r closed 1 year ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Updated |
---|---|---|---|
cleanrl | ✅ Ready (Inspect) | Visit Preview | Jan 4, 2023 at 11:27PM (UTC) |
Hey btw I noticed you ran into some issues with AttributeError: 'GLFWContext' object has no attribute '_context'
. You just need to install some mujoco dependencies for rendering. https://github.com/vwxyzjn/cleanrl/blob/b558b2b48326d8bb8da7fa853914576ae4610f53/.github/workflows/tests.yaml#L154-L156
See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.
Hey btw I noticed you ran into some issues with
AttributeError: 'GLFWContext' object has no attribute '_context'
. You just need to install some mujoco dependencies for rendering.See https://github.com/deepmind/dm_control#rendering. I added a bit of docs to explain this issue.
I have tried these commands on my local machine, but the issue persists. I just wanted to give an update.
It might be the case you are using hardware-accelerated rendering. Maybe look up https://github.com/deepmind/dm_control#rendering and https://github.com/deepmind/dm_control/issues/136
Hi @vwxyzjn , did you get around this issue? https://github.com/Farama-Foundation/Shimmy/issues/15#issue-1446206487 I am facing a similar issue with dm_control.
Hi @vwxyzjn , did you get around this issue? Farama-Foundation/Shimmy#15 (comment) I am facing a similar issue with dm_control.
Actually yes, I can reproduce this error on my end. For now try running the code without --capture-video
.
Here is the result comparison between PPO and RPO. These dm_control environments are used in the RPO paper. The RPO paper: https://arxiv.org/pdf/2212.07536.pdf.
PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.
ppo_continuous_action ({'tag': ['v1.0.0-13-gcbd83f6']}) | rpo_continuous_action ({'tag': ['pr-331']}) | |
---|---|---|
dm_control/humanoid-stand-v0 | 21.41 ± 30.31 | 140.84 ± 57.21 |
dm_control/humanoid-run-v0 | 5.47 ± 9.32 | 23.82 ± 19.57 |
dm_control/hopper-hop-v0 | 6.33 ± 18.58 | 62.46 ± 88.61 |
dm_control/quadruped-walk-v0 | 241.46 ± 92.15 | 610.52 ± 215.04 |
dm_control/quadruped-run-v0 | 172.81 ± 40.87 | 370.70 ± 119.32 |
dm_control/quadruped-escape-v0 | 22.78 ± 11.22 | 67.82 ± 28.01 |
dm_control/walker-stand-v0 | 451.71 ± 220.04 | 724.61 ± 158.48 |
dm_control/walker-run-v0 | 124.14 ± 88.88 | 389.88 ± 126.93 |
dm_control/fish-swim-v0 | 83.03 ± 8.93 | 130.44 ± 51.66 |
Learning curve:
ppo_continuous_action_slurm
is the ppo_continuous_action.py
run for 8M timesteps and 10 random seeds.
The rest of the dm_control experiments are running and will be added later.
This is perfect. Looking forward to seeing the rest of the results. A quick tip is that instead of creating a separate ppo_continuous_action_slurm
file, you could do python ppo_continuous_action.py --exp-name ppo_continuous_action_8M --total-timesteps 8000000
Feel free to prepare the documentation in /docs/rl-algorithms/rpo.md
. https://github.com/vwxyzjn/cleanrl/blob/master/docs/rl-algorithms/ppg.md might be a good example.
Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action
. Using the arg --exp-name
is a better option.
Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the ppo_continuous_action.py
to make ppo_continuous_action_slurm.py
. Is there a better way of doing that? Is there any arg that I can use?
As of the final merging, I will only keep the rpo_continuous_action.py
and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py
. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.
Thanks for the tips. I was too worried not to mess up with existing experiments with the name ppo_continuous_action.
This used to be a concern but not anymore since we are now doing filter-based workflow.
Additionally, I wanted to direct the experiment's files (wandb, tensorboard, video, etc.) to a separate storage. This is not to overflow my workspace where I have code, as my workspace has a small storage budget. Thus I edited the
--exp-name
should direct files to folders prefixed with exp_name
, so it should have identical effects as creating a ppo_continuous_action_slurm.py
. One other thing you might leverage is you could call the file from any directory and it will save the tensorboard and video files at that directory. For example, calling python cleanrl/ppo.py
vs cd cleanrl && python ppo.py
will have the storage files in different directories.
As of the final merging, I will only keep the rpo_continuous_action.py and remove the rest of the files I used to run the jobs, including ppo_continuous_action_slurm.py. The code will still be available in the wandb experiments file (openrlbenchmark). Is that fine? Or let me know if there is a better way.
There is no need to delete those experiments because the last benchmarked experiments are tagged with latest
, which will be used for querying. So having these ppo_continuous_action_slurm.py
would not affect any existing workflows. I would say that you should remove failed
and crashed
experiments.
There is no need to delete those experiments because the last benchmarked experiments are tagged with
latest
, which will be used for querying. So having theseppo_continuous_action_slurm.py
would not affect any existing workflows. I would say that you should removefailed
andcrashed
experiments.
I am thinking about this PR. I have created some additional files (run_rpo.sh
, ppo_continuous_action_slurm.py
, etc.) to facilitate running the experiments. My understanding is that this PR will add two files to the vwxyzjn/cleanrl
: rpo_continuous_action.py
, and rpo.sh
for the benchmarking.
My understanding is that this PR will add two files to the vwxyzjn/cleanrl: rpo_continuous_action.py, and rpo.sh for the benchmarking.
That is correct. To keep track of the ppo_continuous_action_slurm.py
experiments, I would say update the ppo.sh
to add lines such as
OMP_NUM_THREADS=1 xvfb-run -a poetry run python -m cleanrl_utils.benchmark \
--env-ids dm_control/acrobot-swingup-v0 dm_control/acrobot-swingup_sparse-v0 dm_control/ball_in_cup-catch-v0 dm_control/cartpole-balance-v0 dm_control/cartpole-balance_sparse-v0 dm_control/cartpole-swingup-v0 dm_control/cartpole-swingup_sparse-v0 dm_control/cartpole-two_poles-v0 dm_control/cartpole-three_poles-v0 dm_control/cheetah-run-v0 dm_control/dog-stand-v0 dm_control/dog-walk-v0 dm_control/dog-trot-v0 dm_control/dog-run-v0 dm_control/dog-fetch-v0 dm_control/finger-spin-v0 dm_control/finger-turn_easy-v0 dm_control/finger-turn_hard-v0 dm_control/fish-upright-v0 dm_control/fish-swim-v0 dm_control/hopper-stand-v0 dm_control/hopper-hop-v0 dm_control/humanoid-stand-v0 dm_control/humanoid-walk-v0 dm_control/humanoid-run-v0 dm_control/humanoid-run_pure_state-v0 dm_control/humanoid_CMU-stand-v0 dm_control/humanoid_CMU-run-v0 dm_control/lqr-lqr_2_1-v0 dm_control/lqr-lqr_6_2-v0 dm_control/manipulator-bring_ball-v0 dm_control/manipulator-bring_peg-v0 dm_control/manipulator-insert_ball-v0 dm_control/manipulator-insert_peg-v0 dm_control/pendulum-swingup-v0 dm_control/point_mass-easy-v0 dm_control/point_mass-hard-v0 dm_control/quadruped-walk-v0 dm_control/quadruped-run-v0 dm_control/quadruped-escape-v0 dm_control/quadruped-fetch-v0 dm_control/reacher-easy-v0 dm_control/reacher-hard-v0 dm_control/stacker-stack_2-v0 dm_control/stacker-stack_4-v0 dm_control/swimmer-swimmer6-v0 dm_control/swimmer-swimmer15-v0 dm_control/walker-stand-v0 dm_control/walker-walk-v0 dm_control/walker-run-v0 \
--command "poetry run python cleanrl/ppo_continuous_action.py --exp-name ppo_continuous_action_slurm --cuda False --track" \
--num-seeds 3 \
--workers 9
I would however say that the name ppo_continuous_action_slurm
is quite hacky... If it is not too much trouble it maybe preferable to re-run the experiments with the names ppo_continuous_action_8M
, but it's not a big deal to me if you want to keep the current ones. We just need to add a short explanation of it in the docs.
Re-running would not be a problem.
However, I want to make sure I understand it fully. The name in the experiments needs to be ppo_continuous_action_8M
instead of ppo_continuous_action_slurm
. Is that right?
In that case, for the benchmarking, it is easier for me to rename ppo_continuous_action_slurm.py
to ppo_continuous_action_8M.py
. This is because I need to store the experiment data in a storage-only file system.
I will edit the command in ppo.sh
as you mention.
Re-running would not be a problem. However, I want to make sure I understand it fully. The name in the experiments needs to be
ppo_continuous_action_8M
instead ofppo_continuous_action_slurm
. Is that right?
That’s right. The exp_name
needs to be ppo_continuous_action_8M
for filtering.
In that case, for the benchmarking, it is easier for me to rename
ppo_continuous_action_slurm.py
toppo_continuous_action_8M.py
. This is because I need to store the experiment data in a storage-only file system.
Sounds good. Thanks.
Results on all dm_control environments. The PPO and RPO run for 8M timesteps, and results are computed over 10 random seeds.
Table: | ppo_continuous_action_8M ({'tag': ['v1.0.0-13-gcbd83f6']}) | rpo_continuous_action ({'tag': ['pr-331']}) | |
---|---|---|---|
dm_control/acrobot-swingup-v0 | 25.12 ± 7.77 | 41.62 ± 5.26 | |
dm_control/acrobot-swingup_sparse-v0 | 1.71 ± 0.74 | 3.25 ± 0.91 | |
dm_control/ball_in_cup-catch-v0 | 937.71 ± 8.68 | 941.16 ± 9.67 | |
dm_control/cartpole-balance-v0 | 791.93 ± 13.67 | 797.17 ± 9.63 | |
dm_control/cartpole-balance_sparse-v0 | 990.23 ± 7.34 | 989.03 ± 5.20 | |
dm_control/cartpole-swingup-v0 | 583.60 ± 13.82 | 615.99 ± 11.85 | |
dm_control/cartpole-swingup_sparse-v0 | 240.45 ± 299.08 | 526.18 ± 190.39 | |
dm_control/cartpole-two_poles-v0 | 217.71 ± 5.15 | 219.31 ± 6.76 | |
dm_control/cartpole-three_poles-v0 | 159.93 ± 2.12 | 160.10 ± 2.36 | |
dm_control/cheetah-run-v0 | 473.61 ± 107.87 | 562.38 ± 59.67 | |
dm_control/dog-stand-v0 | 332.40 ± 24.13 | 504.45 ± 135.58 | |
dm_control/dog-walk-v0 | 124.68 ± 23.05 | 166.50 ± 45.85 | |
dm_control/dog-trot-v0 | 80.08 ± 13.04 | 115.28 ± 30.15 | |
dm_control/dog-run-v0 | 67.93 ± 6.84 | 104.17 ± 24.32 | |
dm_control/dog-fetch-v0 | 28.73 ± 4.70 | 44.43 ± 7.21 | |
dm_control/finger-spin-v0 | 628.08 ± 253.43 | 849.49 ± 23.43 | |
dm_control/finger-turn_easy-v0 | 248.87 ± 80.39 | 447.46 ± 129.80 | |
dm_control/finger-turn_hard-v0 | 82.28 ± 34.94 | 238.67 ± 134.88 | |
dm_control/fish-upright-v0 | 541.82 ± 64.54 | 801.81 ± 35.79 | |
dm_control/fish-swim-v0 | 84.81 ± 6.90 | 139.70 ± 40.24 | |
dm_control/hopper-stand-v0 | 3.11 ± 1.63 | 408.24 ± 203.25 | |
dm_control/hopper-hop-v0 | 5.84 ± 17.28 | 63.69 ± 87.18 | |
dm_control/humanoid-stand-v0 | 21.13 ± 30.14 | 140.61 ± 58.36 | |
dm_control/humanoid-walk-v0 | 8.32 ± 17.00 | 77.51 ± 50.81 | |
dm_control/humanoid-run-v0 | 5.49 ± 9.20 | 24.16 ± 19.90 | |
dm_control/humanoid-run_pure_state-v0 | 1.10 ± 0.13 | 3.32 ± 2.50 | |
dm_control/humanoid_CMU-stand-v0 | 4.65 ± 0.30 | 4.34 ± 0.27 | |
dm_control/humanoid_CMU-run-v0 | 0.86 ± 0.08 | 0.85 ± 0.04 | |
dm_control/manipulator-bring_ball-v0 | 0.41 ± 0.22 | 0.55 ± 0.37 | |
dm_control/manipulator-bring_peg-v0 | 0.57 ± 0.24 | 2.64 ± 2.19 | |
dm_control/manipulator-insert_ball-v0 | 41.53 ± 15.58 | 39.24 ± 17.19 | |
dm_control/manipulator-insert_peg-v0 | 49.89 ± 14.26 | 53.23 ± 16.02 | |
dm_control/pendulum-swingup-v0 | 464.82 ± 379.09 | 776.21 ± 20.15 | |
dm_control/point_mass-easy-v0 | 530.17 ± 262.14 | 652.18 ± 21.93 | |
dm_control/point_mass-hard-v0 | 132.29 ± 69.71 | 184.29 ± 24.17 | |
dm_control/quadruped-walk-v0 | 239.11 ± 85.20 | 600.71 ± 213.40 | |
dm_control/quadruped-run-v0 | 165.70 ± 43.21 | 375.78 ± 119.95 | |
dm_control/quadruped-escape-v0 | 23.82 ± 10.82 | 68.19 ± 30.69 | |
dm_control/quadruped-fetch-v0 | 183.03 ± 37.50 | 227.47 ± 21.67 | |
dm_control/reacher-easy-v0 | 769.10 ± 48.35 | 740.93 ± 51.58 | |
dm_control/reacher-hard-v0 | 636.89 ± 77.68 | 584.33 ± 43.76 | |
dm_control/stacker-stack_2-v0 | 61.73 ± 17.34 | 64.32 ± 7.15 | |
dm_control/stacker-stack_4-v0 | 71.68 ± 29.03 | 53.78 ± 20.09 | |
dm_control/swimmer-swimmer6-v0 | 152.82 ± 30.77 | 166.89 ± 14.49 | |
dm_control/swimmer-swimmer15-v0 | 148.59 ± 22.63 | 153.45 ± 17.12 | |
dm_control/walker-stand-v0 | 453.73 ± 217.72 | 736.72 ± 145.16 | |
dm_control/walker-walk-v0 | 308.91 ± 87.85 | 785.61 ± 133.29 | |
dm_control/walker-run-v0 | 123.98 ± 91.34 | 384.08 ± 123.89 |
Learning curves:
The results look exceptionally clean 👍. After all the reported results, I would feel comfortable marking the existing ppo_continuous_action.py
as "deprecated" or "not recommended" in favor of rpo_continuous_action.py
because of the simplicity of the approach and the superiority of the empirical results.
Please update the docs with these results. Please also remove the ppo_continuous_action_slurm.py
results from the project, as they are no longer needed.
Thank you, and have a great holiday.
Seeing how much this improves the scores, I'm now really curious what happens if we change from Gaussian to the Beta distribution... Or simply clip log std, so that std will not be able to go down to too small value
@Howuhh Trying Beta distribution would be an interesting idea. I observed some performance effects when trying with Laplace and Gumbel (https://arxiv.org/pdf/2212.07536.pdf Figure 7 & 8).
An aspect of the RPO method is maintaining randomness through Uniform distribution. It would be interesting to see if clip log std can achieve a similar effect.
Hi @vwxyzjn, I have added a doc file named rpo.md
. Please review it when you have some time.
Resolved. Thank you for the suggestion.
poetry and
BipedalWalker-v3
If we want to include the
BipedalWalker-v3
experiments, we should also update the dependencies to include box2d-py for reproducibility purposes, but that seems to cause an issue for poetry (see Farama-Foundation/Gymnasium#86 (comment)). What we can do instead is to specify the installation in theUsage
section of the docs. Something like# mujoco v4 environments poetry install --with mujoco python cleanrl/rpo_continuous_action.py --help python cleanrl/rpo_continuous_action.py --env-id Hopper-v2 # dm_control v4 environments poetry install --with mujoco,dm_control python cleanrl/rpo_continuous_action.py --env-id dm_control/cartpole-balance-v0 # BipedalWalker-v3 experiment (hack) poetry install poetry run pip install box2d-py==some-version python cleanrl/rpo_continuous_action.py --env-id BipedalWalker-v3
Help needed:
The docs work as expected, except when I add iframe
to incorporate tracked experiments. When I add tracked experiments iframe
the docs show the first iframe tracked results and break from there. Do you have any idea what I am missing here? @vwxyzjn
RPO Doc: https://github.com/masud99r/cleanrl/blob/master/docs/rl-algorithms/rpo.md
Thanks for the fix. Were those tab/space issues or something different?
Hey I fixed some formatting issues with the iframe. Please let me know if you run into additional issues.
I have added a warning section with the depreciation notice.
With that, I think I have addressed all the comments, but I might lose some in the thread! Could you please take a look and let me know what else to be done for this PR?
LGTM. Thanks for your patience and this awesome contribution. Please give it a final read at https://cleanrl-git-fork-masud99r-master-vwxyzjn.vercel.app/rl-algorithms/rpo/. I have added you as a collaborator of this repo, and feel free to go ahead and merge.
It was my pleasure working on this. Thanks for your excellent guidance throughout the process. I have learned a lot.
I took a pass on the docs and fixed some typos. I am going to merge.
Description
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.