Question on the soft q learning implementation

rail-berkeley / softlearning

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Other

1.2k stars 239 forks source link

Hi Haarnoja,

Thanks a lot for maintaining the amazing repo! I feel a little confused about the implementation of SVGD in soft-q learning. At https://github.com/rail-berkeley/softlearning/blob/05daa5524ae1a76b70b8a8a29a0f5f824d401484/softlearning/algorithms/sql.py#L281 ，the log probs is calculated as log_probs = svgd_target_values + squash_correction，where is log probs on the $u$(raw_action) space. ($a$ = tanh($u$)) However, the following SVGD used the log probs on the $u$ space to get the updated directions of $a$, which seems to be not aligned.

I think there should be actions = self._policy.raw_actions(expanded_observations) in https://github.com/rail-berkeley/softlearning/blob/05daa5524ae1a76b70b8a8a29a0f5f824d401484/softlearning/algorithms/sql.py#L235. (the policy class could add this property.)

Best， Yuxuan

rail-berkeley / softlearning

Question on the soft q learning implementation #143