openai / maddpg

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"
https://arxiv.org/pdf/1706.02275.pdf
MIT License
1.58k stars 485 forks source link

action exploration & Gumbel-Softmax #9

Open djbitbyte opened 6 years ago

djbitbyte commented 6 years ago

Hello, I have questions on exploration and Gumbel-Softmax.

In the pseudocode, it mentioned initialize random process N for action exploration, which is same in the paper of DDPG. But I have difficulty to understand the exploration in your implementation. Is it Ornstein-Uhlenbeck process used for this algorithm, same as DDPG? Could you explain how you handled action exploration?

Another question, did you use softmax instead of Gumbel-Softmax?

I have tried to implement MADDPG on scenario of simple-speaker-listener, but not with Ornstein-Uhlenbeck process for action exploration, and only softmax for actor network. The other parts are the same as on paper, but my speaker is converged to telling same wrong target landmark, and listener is wondering around or in between the 3 landmarks. I guess the listener ignoreed speaker as described on paper. And I've tried yours on simple-speaker-listener, it converges correctly for some trainings. Are the action exploration and activation functions the reasons for wrong convergence, do they have big impact on training process?

Thanks for your time!

pengzhenghao commented 6 years ago

I think in this implementation they use softmax as the output activation function when sampling action. And see the code below, you can find that they have attempted to use Argmax activation by return CategoricalPdType(ac_space.n) when sampling. But eventually they use softmax activation when training the Q net.

def make_pdtype(ac_space):
    from gym import spaces
    if isinstance(ac_space, spaces.Box):
        assert len(ac_space.shape) == 1
        return DiagGaussianPdType(ac_space.shape[0])
    elif isinstance(ac_space, spaces.Discrete):
        # return CategoricalPdType(ac_space.n)
        return SoftCategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
        #return MultiCategoricalPdType(ac_space.low, ac_space.high)
        return SoftMultiCategoricalPdType(ac_space.low, ac_space.high)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
        raise NotImplementedError
djbitbyte commented 6 years ago

Hello, @PengZhenghao!

I've looked into involved functions again, I guess they use SoftCategoricalPdType(ac_space.n), then SoftCategoricalPdType.sample() to somehow add noise to actions, finally to softmax(logits - noise) as output of actor network.

And the noise added to action is from: def sample(self): u = tf.random_uniform(tf.shape(self.logits)) return U.softmax(self.logits - tf.log(-tf.log(u)), axis=-1)

I don't quite get it why they handle the noise in this way.

djbitbyte commented 6 years ago

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

LiuQiangOpenMind commented 5 years ago

Hello, @PengZhenghao! I don't quite get it why they handle the noise in the form of the log-log link function. Due to the log-log link function is non-linear function, the noise randomly generated every time could fluctuate, how to control the degree of noise to ensure adequate action exploration? 111

pengzhenghao commented 5 years ago

Hello, @PengZhenghao! I don't quite get it why they handle the noise in the form of the log-log link function. Due to the log-log link function is non-linear function, the noise randomly generated every time could fluctuate, how to control the degree of noise to ensure adequate action exploration? 111

Gumble-Softmax Trick is an important re-parameterization trick that can help smoothing the back propagation. I refer you to search with keyword "gumbel softmax" for more information. I am sorry for not providing more info since I do not thoroughly understand the whole process of gumbel softmax...

Ah31 commented 5 years ago

Hello @djbitbyte!

You said that gumbel softmax helps to speed up stabilize the training. I am trying to reproduce the results in pytorch and using torch.nn.functional._gumbel_softmax_sample while sampling the action for current state as follows:

act Also I am using torch.nn.functional.gumbel_softmax for computing target actions for next states and for computing action for current agent to be fed into actor_local. Based on the original code and the algorithm, I am not able to understand why training is not converging after i use gumbel_softmax.

Thanks in advance!

Ah31 commented 4 years ago

Hello! Just to mention that there were many other issues in the code instead of gumbel-softmax because of which the training was not converging.

kargarisaac commented 4 years ago

Hello! Just to mention that there were many other issues in the code instead of gumbel-softmax because of which the training was not converging.

Hi, I'm trying to understand how to use gumbel_softmax in pytorch to reproduce the results. I'm using PPO but it cannot even learn the task for only one agent and one landmark completely. It reaches some good level but it's not as good as MADDPG at all. I think the problem is with this simple softmax and Categorical distribution I use and want to change it to humble softmax. I used:

policy_dist = distributions.Categorical(F.gumbel_softmax(policy_logits_out, tau=1, hard=False).to("cpu"))

But didn't get good results. There is also a distribution.Gumble in pytorch. I think I use them incorrectly.

Can you provide an example to use them in your own algorithm?

Thank you

tanxiangtj commented 4 years ago

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

can you provide the code of your implementation of Gumbel-softmax? I meet the same problem when using MADDPG. many thanks.