openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.2k stars 2.23k forks source link

Why does SAC's policy run gradient descent? #241

Open alexfrom0815 opened 4 years ago

alexfrom0815 commented 4 years ago

I am a little confused in the implementation of spinningup's SAC. In spinningup's tutorial, SAC's runs a policy ascent to maximize the (Q(a) + log(p(a))), but when I read the code, I find that the compute_loss_pi is implemented like that: ... loss_pi = (alpha * logp_pi - q_pi).mean() ... loss_pi.backward() pi_optimizer.step() ... I think this step means to minimize the target (Q(a) + log(p(a))), the opposite of maximizing. I am a new comer of RL, I don't know where I have missed, any advice will be appreciate!

alexfrom0815 commented 4 years ago

Oh, I think I come up with a dumm question, (alpha logp_pi - q_pi) = -(q_pi - alpha logp_pi), thus the code runs gradient ascent. I am much too careless to leave a meaningless question here. I am not sure if I have access to delete this issue (I tried but I don't find the way) or only the developer has the right? If I can delete it, please let me know, in case that this question distract other people's energies.