Policy Loss with Minimum or Q1?

pranz24 / pytorch-soft-actor-critic

PyTorch implementation of soft actor critic

MIT License

822 stars 182 forks source link

Closed pranv closed 5 years ago

pranv commented 5 years ago

should it not it be this? policy_loss = ((self.alpha * log_prob) - q1_new).mean()

pranv commented 5 years ago

Sorry, I think your implementation is correct given that the paper says:

We then use the minimum of the Q-functions for the value gradient in Equation 6 and policy gradient in Equation 1

I was misled by OpenAI spinning up's implementation here. I guess I'll open an issue there

pranv commented 5 years ago

Just to conclude: actually, authors themselves use q1, but they say it does not make much difference.

pranz24 commented 5 years ago

Thank you, for the thorough evaluation, anyway.

wayunderfoot commented 4 years ago

But in TD3impletion,the author use Q1 for the policy update.