[Rllib] SAC does not optimize the entropy

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.98k stars 5.77k forks source link

[Rllib] SAC does not optimize the entropy #9087

Closed p-christ closed 4 years ago

p-christ commented 4 years ago

I ran SAC with the default config and found that the entropy term was not optimized and it just remained at its start value throughout training.

Does anyone know what could be causing this? Is there a config value I need to provide in order for the entropy to get optimized?

I was using Ray version: 0.8.5

p-christ commented 4 years ago

Anyone got any ideas how to solve this?

sven1977 commented 4 years ago

Do you have a simple repro script? Or does it happen for a default e.g. Pendulum-v0 run as well?

sven1977 commented 4 years ago

Ok, I just checked for PyTorch and found no problems there. Let me know, if you are on tf and I can check there as well. You can check for yourself by breakpointing into the rllib/policy/torch_policy.py::~339 and stop when the for loop that loops through the losses + respective optimizers stops for the alpha-loss/alpha-optimizer. After the opt.step() call, the value of log-alpha has changed.

sven1977 commented 4 years ago

Actually, I just checked tf2 (eager mode) as well and it works also:

# before the update in line ~360 (rllib/agents/sac/sac_tf_policy.py)
policy.model.log_alpha
<tf.Variable 'default_policy/log_alpha:0' shape=() dtype=float32, numpy=0.0>
# after the update
policy.model.log_alpha
<tf.Variable 'default_policy/log_alpha:0' shape=() dtype=float32, numpy=-0.00029999946>

I will close this issue. Please let us know, whether this still isn't working on your end.

sven1977 commented 4 years ago

@p-christ