rail-berkeley / rlkit

Collection of reinforcement learning algorithms
MIT License
2.5k stars 553 forks source link

SAC policy loss #73

Closed rmrafailov closed 5 years ago

rmrafailov commented 5 years ago

Based on the latest SAC code release the policy loss is given by:

policy_loss = (alpha*log_pi - q_new_actions).mean()

Which makes sense as empirical estimation of the KL projection loss, besides the alpha parameter. Why is that included? I thought this step, just tries to project the policy into the exp(Q) distribution, so why the alpha?

Thank you.

vitchyr commented 5 years ago

See Section 5 of https://arxiv.org/pdf/1812.05905.pdf, which is the updated version of SAC referred to in the README. Basically, this converts "entropy regularized RL" into "entropy constrained RL" where the entropy term is weighted by alpha, an automatically tuned parameter that ensure the entropy is above some threshold.