Problem of optimizing policy

shariqiqbal2810 / MAAC

Code for "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" ICML 2019

MIT License

676 stars 173 forks source link

Problem of optimizing policy #20

Closed zsano1 closed 4 years ago

zsano1 commented 4 years ago

Hi Shariq, first thank you for your code! And it works well. But when optimizing policy, shouldn't it be probs * (-pol_target)? Why we use log_pi here? https://github.com/shariqiqbal2810/MAAC/blob/bd263afce709795293964badd16655b5747b9056/algorithms/attention_sac.py#L150

shariqiqbal2810 commented 4 years ago

Hi,

This is the log derivative trick for estimating the gradient of the expected returns. Here is a good explanation.

zsano1 commented 4 years ago

Hi, thanks for your reply! But log_pi here is the log of one selected action right? In my opinion, we need torch.log(probs), which is the log of policy distribution.

shariqiqbal2810 commented 4 years ago

The purpose of the log is to account for sampling from the policy distribution (which estimates an expectation), so we only need the log_pi of the sampled action. If we wanted to use the probabilities of each action, we would also need all of their Q-values, and the method would resemble Mean Actor-Critic

zsano1 commented 4 years ago

Oh yes, you are right. I was assuming your q here is all actions' values. Thanks for your explanation.