use alpha not log alpha in autotune

SobhanMP commented 10 months ago

Description

The lagrange multiplier in the entropy constraint (alpha, H(\pi) < H) needs to be positive for the relaxation to work. This means that alpha not log alpha should be used here. See the Softlearnign code at: https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256

but the real question is: A) does this matter? B) am i making sense?

I'll answer A tomorrow, not sure about B tho.

Types of changes

[x] Bug fix
[ ] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).
[ ] I have updated the tests accordingly (if applicable).
[ ] I have updated the documentation and previewed the changes via mkdocs serve.
- [ ] I have explained note-worthy implementation details.
- [ ] I have explained the logged metrics.
- [ ] I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

[ ] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
[ ] I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture-video.
[ ] I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - [ ] I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [ ] I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- [ ] I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel[bot] commented 10 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 30, 2023 0:07am

dosssman commented 10 months ago

@SobhanMP Nice catch. Thansk a lot for bringing this to our attention.

Indeed, both your code reference as well as the 2018 SAC paper suggest stipulate the usage of alpha instead of log_allpha.

Are you perhaps already testing your change ? Otherwise I can get some runs started on my side to check the effect of this change. Chances are, the results will not be that much different, but let's definitely stick as close as possible to the correct theory and implementation.

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ? https://github.com/vwxyzjn/cleanrl/blob/f36d4a6426b5b08efdab4f65d7f696983cb04d4c/cleanrl/sac_atari.py#L314

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @arrafin https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

SobhanMP commented 10 months ago

@dosssman would be nice if you can test it, I'm having a big of a rough time installing clearnrl :sweat_smile:

dosssman commented 10 months ago

@SobhanMP Actually tried to install a clean environment since most of the ones I have are quite dated but I also get a lot of failures from poetry.

What kind of errors do you get when trying to install cleanrl now ?

SobhanMP commented 10 months ago

Poetry seems to shoot itself in the leg and remove tomli. Other than that, GL.h not found when i try to run it on the cluster. Ah, pyyaml doesn't work with cython 3 anymore, something about the license link being broken

SobhanMP commented 10 months ago

Installing poetry as root with a package manager seems to overcome a bunch of the issues

dosssman commented 10 months ago

I usually create a fresh conda env first, within which I then use poetry install. Most issues did not happen when using python=3.9 instead of `3.10.

Also, notes for solving mujoco_py related issues:

Mujoco install notes

GL/osmesa.h not found when building mujoco-py: sudo apt-get install libosmesa6-dev.
patchelf not found: sudo apt-get install patchelf ...

SobhanMP commented 10 months ago

sudo apt-get install libosmesa6-dev I figured that much, I don't have root on the cluster...

See the runs below: the left is from the open benchmark thing, right is with the fix. On another note, I think it would be a good idea to record alpha.

dosssman commented 10 months ago

alpha is also tracked by default, autotune or not.

Tracking benchmarks results in this Wandb report: https://api.wandb.ai/links/openrlbenchmark/owsly8ve Light teal for logalpha (master) variant, pink for the PR's fix.

EDIT: given that alpha values are very small with or without the fix, we likely won't see much difference. At least not on the three tasks we usually benchmark. Maybe a more complex task that benefits more from autotune would see some improvements though.

dosssman commented 10 months ago

@vwxyzjn Hello there. Any idea why the vercel test fails ? Thanks for your time.

timoklein commented 10 months ago

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

https://github.com/vwxyzjn/cleanrl/blob/f36d4a6426b5b08efdab4f65d7f696983cb04d4c/cleanrl/sac_atari.py#L314

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

It's confusing for SAC discrete. The paper states a similar formula (using alpha) Screenshot from 2023-08-21 13-45-11

whereas the author's original code uses log_alpha. Can be seen here

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/135d3e2e06bbde2868047d738e3fc2d73fd8cc93/agents/actor_critic_agents/SAC.py#L184

According to the SAC authors this should be a non-issue though and most other major libraries use log_alpha as well, e.g. tianshou.

I'm honestly not quite sure what to do here. @SobhanMP plots show improved stability, so we should change it I guess?

timoklein commented 10 months ago

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

https://github.com/vwxyzjn/cleanrl/blob/f36d4a6426b5b08efdab4f65d7f696983cb04d4c/cleanrl/sac_atari.py#L314

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

It's confusing for SAC discrete. The paper states a similar formula (using alpha) Screenshot from 2023-08-21 13-45-11

whereas the author's original code uses log_alpha. Can be seen here

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/135d3e2e06bbde2868047d738e3fc2d73fd8cc93/agents/actor_critic_agents/SAC.py#L184

According to the SAC authors this should be a non-issue though and most other major libraries use log_alpha as well, e.g. tianshou.

I'm honestly not quite sure what to do here. @SobhanMP plots show improved stability, so we should change it I guess?

SobhanMP commented 10 months ago

Actually, let me benchmark it again; I ran the original code w/o any change, tho it is suspicious that I have less variance.

dosssman commented 10 months ago

Benchmarks available here:https://api.wandb.ai/links/openrlbenchmark/eykdulk2. No significant difference with either log_alpha (light teal lines) or alpha (pink lines). Even the value of alpha coefficient itself are similar across variants.

Should be ready to merge.

SobhanMP commented 10 months ago

@dosssman thanks!

timoklein commented 10 months ago

I'm gonna run some experiments with SAC-discrete to verify that it works there as well. It's a good opportunity for me to get into the rlops/benchmarking stuff which I haven't used before.

After everything works we can merge it and also merge https://github.com/vwxyzjn/cleanrl/pull/383 while we're at it.

dosssman commented 10 months ago

After everything works we can merge it and also merge https://github.com/vwxyzjn/cleanrl/pull/383 while we're at it.

Great. I went ahead and tested the combination of that paper and with this one, yellow lines on top of teal and pink as specified in earlier post:

Also no significant regression in performance.

Any idea why the vercel related test fails here though ?

timoklein commented 10 months ago

0 clue about vercel. I'm not using it personally and haven't really interacted with it.

timoklein commented 10 months ago

Atari results take a while, and I can't run that many experiments in parallel due to RAM constraints for the replay buffer. Here are some results for SAC-discrete BeamRider and BreakOut.

BeamRider BreakOut

Pong results are still waiting, but I can already say:

Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

dosssman commented 10 months ago

Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

Did you manage to identify the specific reason why it is dying ? Since you mentioned RAM constraints, in my experience, I had some run die due to those. Might be useful to check the System / Allocated System Memory around the time the run died. It might not have anything to do with alpha.exp(), unless it explicitly throws an error regarding an invalid computation, or something along those lines (Wandb / Logs ?)

timoklein commented 10 months ago

Did you manage to identify the specific reason why it is dying?

By dying I mean the performance collapses and never recovers. Sorry for not being clear. I had one other run crash due to RAM constraints but I'm re-running it.

timoklein commented 10 months ago

Here's the results for Pong Pong

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

timoklein commented 10 months ago

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

To verify this claim, I ran the same seed that failed before (seed=1) with target_entropy_scale=0.8. Results are below:

seed_1_entropy_0_8

I'm gonna add the .exp() later as well. It is explicitly stated in the docs that

Tuning this parameter to the environment at hand is advised and can lead to significant performance gains.

so a non-performing seed isn't the end of the world IMO.

timoklein commented 10 months ago

Ready for merge from my side @dosssman

dosssman commented 10 months ago

@timoklein Thanks a lot. Looking good on my side too. I have also validated the actor_loss shape #383 (combined with this PR's changes too, no regression), so that one is also good to merge.

timoklein commented 10 months ago

Who wants the honor to merge it? Or should we wait for Costa?

vwxyzjn commented 10 months ago

Really great work everyone! Thanks so much for the PR and the benchmark effort. @dosssman or @timoklein Fee free to merge it!

vwxyzjn / cleanrl