Why does SAC's policy run gradient descent?

openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.

MIT License

10.2k stars 2.23k forks source link

I am a little confused in the implementation of spinningup's SAC. In spinningup's tutorial, SAC's runs a policy ascent to maximize the (Q(a) + log(p(a))), but when I read the code, I find that the compute_loss_pi is implemented like that: ... loss_pi = (alpha * logp_pi - q_pi).mean() ... loss_pi.backward() pi_optimizer.step() ... I think this step means to minimize the target (Q(a) + log(p(a))), the opposite of maximizing. I am a new comer of RL, I don't know where I have missed, any advice will be appreciate!

openai / spinningup

Why does SAC's policy run gradient descent? #241