Closed roosephu closed 5 years ago
self.policy.evaluate(next_state_batch)
.policy_loss = -(expected_new_q_value).mean()
(same as DDPG policy loss). This means that we will no longer require the regularization loss.(Although I have not given your question a lot of thought but these 3 points seemed very clear to me when I read the paper again today. I am very busy at the moment (at least this week). So, if you can give me a week's time then I might get back to you with a bit more information. Also I have no idea why I made these mistakes -_- )
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L90 I've made some changes according to your query. Let me know if there is anything else that, you think, is wrong in the implementation.
Nice! Your code is really helpful, thanks!
https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L87
Should we here use
new_action
orself.policy(next_state_batch)
?