Closed p-christ closed 4 years ago
Anyone got any ideas how to solve this?
Do you have a simple repro script? Or does it happen for a default e.g. Pendulum-v0 run as well?
Ok, I just checked for PyTorch and found no problems there. Let me know, if you are on tf and I can check there as well. You can check for yourself by breakpointing into the rllib/policy/torch_policy.py::~339 and stop when the for loop that loops through the losses + respective optimizers stops for the alpha-loss/alpha-optimizer. After the opt.step() call, the value of log-alpha has changed.
Actually, I just checked tf2 (eager mode) as well and it works also:
# before the update in line ~360 (rllib/agents/sac/sac_tf_policy.py)
policy.model.log_alpha
<tf.Variable 'default_policy/log_alpha:0' shape=() dtype=float32, numpy=0.0>
# after the update
policy.model.log_alpha
<tf.Variable 'default_policy/log_alpha:0' shape=() dtype=float32, numpy=-0.00029999946>
I will close this issue. Please let us know, whether this still isn't working on your end.
@p-christ
I ran SAC with the default config and found that the entropy term was not optimized and it just remained at its start value throughout training.
Does anyone know what could be causing this? Is there a config value I need to provide in order for the entropy to get optimized?
I was using Ray version: 0.8.5