SAC cannot converge to optimal policy

mahaozhe commented 11 months ago

Problem Description

When I run experiments on the "MountainCarContinuous-v0" environment, I found that the sac_continuous_action.py can't converge to the optimal policy, compared with rpo_continuous_action.py, SAC will keep a local optimal without any increasing returns:

The learning records from RPO:

global_step=48553, episodic_return=[91.50219]
global_step=48709, episodic_return=[91.06741]
global_step=48859, episodic_return=[91.755295]
global_step=48981, episodic_return=[93.888855]

The learning records from SAC:

global_step=73925, episodic_return=-0.7892028093338013
global_step=74924, episodic_return=-0.7795553207397461
global_step=75923, episodic_return=-0.7974969744682312
global_step=76922, episodic_return=-0.8009135127067566

We can see that SAC will converge to around 0 while RPO can converge to around 100 (optimal policy) much faster.

Checklist

[ ] I have installed dependencies via poetry install (see CleanRL's installation guideline.
[x] I have checked that there is no similar issue in the repo.
[x] I have checked the documentation site and found not relevant information in GitHub issues.

Current Behavior

In the "MountainCarContinuous-v0" environment, SAC algorithm can only converge to around 0 episodic returns. (The agent can't complete the task every time)

Expected Behavior

We expect the SAC can also converge to around 100 episodic returns.

Possible Solution

I tried some different hyper-parameters or running more episodes, however I can't get the expected results.

Steps to Reproduce

I hope you can give me some suggestions to finetune the hyper-parameters or update the algorithm. Thanks a lot!

dosssman commented 10 months ago

Hello. Sorry for late answer. I recall also having some difficulties getting SAC (sometimes other algorithms) to converge on a supposedly trivial task such as MountainCar.

Just from the top of my head, maybe an avenue worth exploring, encouraging more exploration with higher--alpha noise could help overcome the local optimum.

fr30 commented 10 months ago

Hey, had a similar issue with discrete SAC and PPO performance. I wanted to adapt it from training on Atari to solving Minigrid. I thought there must be some issue with algorithm or environment setup but apparently I just had to spend some more time on fine-tuning parameters. Fortunately you don't have to do it by hand, as there's libraries that can handle that for you (like Optuna). I gotta admit though that SAC requires much more time to properly converge.

You could also check out Tips and tricks. Maybe that will help you to spot the issue.

Hope that helps!

mahaozhe commented 10 months ago

Thanks a lot for your comments, @fr30!

vwxyzjn / cleanrl