Closed pranv closed 5 years ago
Sorry, I think your implementation is correct given that the paper says:
We then use the minimum of the Q-functions for the value gradient in Equation 6 and policy gradient in Equation 1
I was misled by OpenAI spinning up's implementation here. I guess I'll open an issue there
Just to conclude: actually, authors themselves use q1, but they say it does not make much difference.
Thank you, for the thorough evaluation, anyway.
But in TD3impletion,the author use Q1 for the policy update.
In line: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L125
should it not it be this?
policy_loss = ((self.alpha * log_prob) - q1_new).mean()