pranz24 / pytorch-soft-actor-critic

PyTorch implementation of soft actor critic
MIT License
822 stars 182 forks source link

Policy Loss with Minimum or Q1? #3

Closed pranv closed 5 years ago

pranv commented 5 years ago

In line: https://github.com/pranz24/pytorch-soft-actor-critic/blob/master/sac.py#L125

should it not it be this? policy_loss = ((self.alpha * log_prob) - q1_new).mean()

pranv commented 5 years ago

Sorry, I think your implementation is correct given that the paper says:

We then use the minimum of the Q-functions for the value gradient in Equation 6 and policy gradient in Equation 1

I was misled by OpenAI spinning up's implementation here. I guess I'll open an issue there

pranv commented 5 years ago

Just to conclude: actually, authors themselves use q1, but they say it does not make much difference.

pranz24 commented 5 years ago

Thank you, for the thorough evaluation, anyway.

wayunderfoot commented 4 years ago

But in TD3impletion,the author use Q1 for the policy update.