ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[RLlib] About the implementation of maddpg #9638

Closed raphaelavalos closed 3 years ago

raphaelavalos commented 4 years ago

Hello, I was looking at the implementation of MADDPG and I don't understand why the last line is here. To me there shouldn't be any sampling. Am I missing something?

https://github.com/ray-project/ray/blob/bfa06052828f83ab790e6b6bbfa5b56edb42b45e/rllib/contrib/maddpg/maddpg_policy.py#L370-L376

@justinkterry I believe you are maintaining this code, could you take a look ? Thanks !

jkterry1 commented 4 years ago

If memory serves, it's because MADDPG is a continuous method set up to work with discrete environments (the MPE environments). The problem is that MADDPG doesn't actually work in production or for really anything other than those environments even if you fix everything. Try parameter sharing instead, it's simpler and performs dramatically better. This is a paper I have under review at NeurIPS, it links to RLlib parameter sharing code you can use: https://arxiv.org/abs/2005.13625.

raphaelavalos commented 4 years ago

Thank you for your quick response! I will add your paper to my list :)


What I don't understand is that RelaxedOneHotCategorical returns a vector (that matches a probability distribution) of the same shape as logits. This distribution is then used as an action, because even-though the env.action_space is Tuple(Discrete(5)) the env expects a vector of size 5.

import tensorflow as tf
import tensorflow_probability as tfp
dist = tfp.distributions.RelaxedOneHotCategorical(temperature=1.0, logits=[[2,3,5]])
dist.sample()

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.01832381, 0.009718 , 0.9719582 ]], dtype=float32)>

And with the default configuration, MPE interprets the action as follow:

agent.action.u[0] += action[0][1] - action[0][2] # this is the first axis
agent.action.u[1] += action[0][3] - action[0][4] # this is the second axis

So basically MPE is taking the difference of probabilities which makes no sense.

Also the RelaxedOneHotCategorical adds noise everywhere but there is Deterministic in the name of the algorithm.

Am I missing something ? or maybe the implementation is wrong?

jkterry1 commented 4 years ago

I mean the default version of MPE is seriously jacked up, use the maintained/fixed version of it here: https://github.com/PettingZoo-Team/PettingZoo. I'll look into this more.

jkterry1 commented 4 years ago

Looked into it further with one of my colleagues. Basically the original MPE environments are all kinds of screwed up. The one's in PettingZoo are actually useable. I think he'll reply below with some of the specifics.

benblack769 commented 4 years ago

So I have worked a bit with the original maddpg and MPE implementations. The original maddpg supports discrete environments by sampling from logits, specifically using the gumbel softmax see the source here. So no, it was never deterministic despite the name at least for discrete environments. It is also worth noting that our experiments suggest that applying this algorithm to environments other than MPE do not work at all, suggesting that something is fundamentally broken with the details of the algorithm, and I suspect that these issues are related to the way it samples actions. In other words, there is something broken about maddpg, and it is not necessarily rllib's fault.

As for MPE, just look at the following two lines of code here

        # environment parameters
        self.discrete_action_space = True
        # if true, action is a number 0...N, otherwise action is a one-hot N-dimensional vector
        self.discrete_action_input = False

This sums up what you are dealing with. A discrete action space, but a continuous action input. Basically, it is internally inconsistent by default. So you might have to change some of these internal settings, or just use the implementation in PettingZoo which I worked hard to fix and make sense. Note that rllib's nightly build supports pettingzoo environments.

Finally, as for the subtraction of probabilities, it actually does make sense if you interpret it as

motion_x = (prob_move_right) - (prob_move_left)

So the motion is actually the expected value of the motion given the probabilities.

raphaelavalos commented 4 years ago

Nice job on PettingZoo, I will definitely use it ! I was looking for a multi-agent env library for a while now :)

To give you some background I am not using the contrib/MADDPG but I implemented my own on PyTorch. It does learn on MPE (with self.discrete_action_space = False) but not as well as the plots so I am trying to fix it.

In the original paper they mention gumbel softmax only for the messages selection as the messages are discrete, however in the implementation it is used for everything so I agree with you @weepingwillowben the problem comes from the sampling.

To me the policy should directly output the action as in DDPG with some noise for exploration. I still have to implement the noise in my version and I will update you if that works.

Mid-related question: do you have some code to make a video of the learned polices of a RLlib multiagent environment? The gym monitor wrapper is in multiagent settings :/

Thanks again for time and reactivity !

benblack769 commented 4 years ago

To answer your last question, I don't know about rllib's MultiAgentEnv. The hard way would be to look at rllib's rollout.py method and see how to run trained agents in a loop, and try to generate the video yourself. Not sure about an easier way.

As for pettingzoo, we weren't planning on making a full blown monitor wrapper, however, we are interested in making it easier to evaluate and visualize pettingzoo agent, and we would be open to discussion on what that should look like.

raphaelavalos commented 4 years ago

@weepingwillowben I followed your advice and modified rollout.py. My PyTorch implementation seems to work well with the MPE set with continuous action space on the three env I tested so far (simple, simple_spread, and simple_push).

@justinkterry you mentioned that you tried to make MADDPG work in other environments that the MPE suite, do you have an env with continuous actions to recommend for testing outside the mpe?

jkterry1 commented 4 years ago

The only ones we ever got to work were the sisl environments in pettingzoo, though for the continuous ones in there we had to manually do a centralized critic of DDPG instead of using the MADDG code. That stuffs cursed.

51616 commented 4 years ago

@justinkterry How do you use the MADDPG in discrete action env? I have my custom env with discrete action but RLlib MADDPG does not work out of the box. I wanna use MADDPG as a baseline algorithm for my work. Any help would be appreciated.

jkterry1 commented 4 years ago

Look at my documentation changes and the discussion in PR #9640.

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!