Open alexfrom0815 opened 4 years ago
Oh, I think I come up with a dumm question, (alpha logp_pi - q_pi) = -(q_pi - alpha logp_pi), thus the code runs gradient ascent. I am much too careless to leave a meaningless question here. I am not sure if I have access to delete this issue (I tried but I don't find the way) or only the developer has the right? If I can delete it, please let me know, in case that this question distract other people's energies.
I am a little confused in the implementation of spinningup's SAC. In spinningup's tutorial, SAC's runs a policy ascent to maximize the (Q(a) + log(p(a))), but when I read the code, I find that the compute_loss_pi is implemented like that: ... loss_pi = (alpha * logp_pi - q_pi).mean() ... loss_pi.backward() pi_optimizer.step() ... I think this step means to minimize the target (Q(a) + log(p(a))), the opposite of maximizing. I am a new comer of RL, I don't know where I have missed, any advice will be appreciate!