Closed alirezakazemipour closed 4 years ago
No, we take actions from policy
Thank you for your reply. Yeah you are right, I was confused cause the notation for reparameterized actions in the paper is f(epsilon_t; s_t) when we take actions from policy. But anyway you're completely right. And my last question is, based on the formula you posted, there is no alpha (temperature parameter) in equation above but you have utilized it in your code?🤔
This will give you a better idea: https://spinningup.openai.com/en/latest/algorithms/sac.html
That link explains the paper: https://arxiv.org/abs/1812.05905 Which uses the alpha and it is about your master branch. My question was about your SAC_V branch which is about: https://arxiv.org/abs/1801.01290 But I figured it out, there is no alpha in the psuedo code because it is alpha=1. I think you should change its value from 0.2 to 1 on your SAC_V branch.
Hi. I guess there is a mistake in the target value which have been written as: vf_target = min_qf_pi - (self.alpha * log_pi)
that is: pi, log_pi, mean, log_std = self.policy.sample(state_batch) qf1_pi, qf2_pi = self.critic(state_batch, pi) min_qf_pi = torch.min(qf1_pi, qf2_pi)
But I blieve min_qf_pi should be: qf1_pi, qf2_pi = self.critic(state_batch, _actionbatch) min_qf_pi = torch.min(qf1_pi, qf2_pi)