pranz24 / pytorch-soft-actor-critic

PyTorch implementation of soft actor critic
MIT License
810 stars 180 forks source link

Target value calculation maistake #25

Closed alirezakazemipour closed 4 years ago

alirezakazemipour commented 4 years ago

Hi. I guess there is a mistake in the target value which have been written as: vf_target = min_qf_pi - (self.alpha * log_pi)

that is: pi, log_pi, mean, log_std = self.policy.sample(state_batch) qf1_pi, qf2_pi = self.critic(state_batch, pi) min_qf_pi = torch.min(qf1_pi, qf2_pi)

But I blieve min_qf_pi should be: qf1_pi, qf2_pi = self.critic(state_batch, _actionbatch) min_qf_pi = torch.min(qf1_pi, qf2_pi)

pranz24 commented 4 years ago

No, we take actions from policy Screenshot from 2020-03-23 01-33-03

alirezakazemipour commented 4 years ago

Thank you for your reply. Yeah you are right, I was confused cause the notation for reparameterized actions in the paper is f(epsilon_t; s_t) when we take actions from policy. But anyway you're completely right. And my last question is, based on the formula you posted, there is no alpha (temperature parameter) in equation above but you have utilized it in your code?🤔

pranz24 commented 4 years ago

This will give you a better idea: https://spinningup.openai.com/en/latest/algorithms/sac.html

alirezakazemipour commented 4 years ago

That link explains the paper: https://arxiv.org/abs/1812.05905 Which uses the alpha and it is about your master branch. My question was about your SAC_V branch which is about: https://arxiv.org/abs/1801.01290 But I figured it out, there is no alpha in the psuedo code because it is alpha=1. I think you should change its value from 0.2 to 1 on your SAC_V branch.