openai / maddpg

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"
https://arxiv.org/pdf/1706.02275.pdf
MIT License
1.6k stars 484 forks source link

How or why the gaussian distribution contributes to the training? #18

Open Chen-Joe-ZY opened 6 years ago

Chen-Joe-ZY commented 6 years ago

It's interesting that the code decomposes the output of actor network as the mean and the standard deviation, and then constructs a new action with a gaussian distribution. In past, there is always a extra noisy factor which will decrease gradually to control the adding noise. I wonder if you can explain how or why this will work :)

PBarde commented 6 years ago

I will be also interested in an answer. As a matter of fact I have difficulties understanding how this work : from my understanding when doing the actor update we compute the gradient of the Q value with respect to an action sampled from the Gaussian distribution (which is thus stochastic) but we apply deterministic policy gradient. How can that be ? Thanks

GoingMyWay commented 5 years ago

I will be also interested in an answer. As a matter of fact I have difficulties understanding how this work : from my understanding when doing the actor update we compute the gradient of the Q value with respect to an action sampled from the Gaussian distribution (which is thus stochastic) but we apply deterministic policy gradient. How can that be ? Thanks

I can not agree more, the code sample actions from a distribution, which deviates its deterministic property.

suiguoxin commented 5 years ago

I will be also interested in an answer. As a matter of fact I have difficulties understanding how this work : from my understanding when doing the actor update we compute the gradient of the Q value with respect to an action sampled from the Gaussian distribution (which is thus stochastic) but we apply deterministic policy gradient. How can that be ? Thanks

I can not agree more, the code sample actions from a distribution, which deviates its deterministic property.

I got the same problem, any progress since ? Many thanks

EastVolcano commented 4 years ago

I think that gaussian distribution is used for exploring the action space. The action is not sampled by the gaussian distribution, but the value of action is determined by the gaussian distribution. The stddev is used to determine the scope of exploration,which is the output of DNN. Here are my personal views, welcome any correction.