Closed zsano1 closed 4 years ago
Hi,
This is the log derivative trick for estimating the gradient of the expected returns. Here is a good explanation.
Hi, thanks for your reply! But log_pi
here is the log of one selected action right? In my opinion, we need torch.log(probs)
, which is the log of policy distribution.
The purpose of the log is to account for sampling from the policy distribution (which estimates an expectation), so we only need the log_pi
of the sampled action. If we wanted to use the probabilities of each action, we would also need all of their Q-values, and the method would resemble Mean Actor-Critic
Oh yes, you are right. I was assuming your q
here is all actions' values. Thanks for your explanation.
Hi Shariq, first thank you for your code! And it works well. But when optimizing policy, shouldn't it be
probs * (-pol_target)
? Why we uselog_pi
here? https://github.com/shariqiqbal2810/MAAC/blob/bd263afce709795293964badd16655b5747b9056/algorithms/attention_sac.py#L150