google-deepmind / meltingpot

A suite of test scenarios for multi-agent reinforcement learning.
Apache License 2.0
579 stars 117 forks source link

Baseline Model Architecture #11

Closed kinalmehta closed 2 years ago

kinalmehta commented 2 years ago

Hi,

Thanks for this awesome repo and a great accompanying paper.

Following are the questions I couldn't find answers to in the paper:

  1. What are the observations that are used for each agent? Only the RGB or POSITION, ORIENTATION, and optionally READY_TO_SHOOT (and any other observation which is substrate-specific) are also combined with RGB?
  2. What is the default model architecture? Is it CNN for feature extraction followed by LSTM/GRU? and how are the observations combined if its combination of RGB with any other obs? And what exactly is the CNN architecture and hyperparameters for recurrent cells?
  3. In the pro-social variant, when training with the per-capita return, do you use centralized critic type architecture or something like VDN or QMIX and also use world observations for better convergence?

Thank you Kinal

jzleibo commented 2 years ago

Hi Kinal,

Thanks for your kind words! The answers to your questions are as follows.

What are the observations that are used for each agent? Only the RGB or POSITION, ORIENTATION, and optionally READY_TO_SHOOT (and any other observation which is substrate-specific) are also combined with RGB?

The complete list of observations used by all agents is RGB, READY_TO_SHOOT, and INVENTORY. The implementation is here.

Many substrates also expose other observations like POSITION and ORIENTATION, but those are only intended for debugging. There is also a one-hot "layers" encoding (sprite identity by position) which can be selected instead of RGB, we didn't use it ourselves but it is available as an alternative option.

Not all substrates naturally expose INVENTORY, in those cases we add a "fake" observation of a zero tensor with the same structure so that the same agent code can be used without modification for all substrates. The implementation is here.

What is the default model architecture? Is it CNN for feature extraction followed by LSTM/GRU? and how are the observations combined if its combination of RGB with any other obs? And what exactly is the CNN architecture and hyperparameters for recurrent cells?

Yes, it's essentially as you say. We used a CNN followed by an MLP. The output of that MLP then fed into an LSTM. The output of the LSTM was transformed through another MLP into the policy. The extra observations (READY_TO_SHOOT, INVENTORY) along with a one-hot representation of the previous step's action were all concatenated to the input to the LSTM. We used this basic network architecture for all the algorithms we tested.

There are some more details in part C of the ICML paper's appendix.

In the pro-social variant, when training with the per-capita return, do you use centralized critic type architecture or something like VDN or QMIX and also use world observations for better convergence?

The prosocial algorithms we tried were the most naive kind. On each timestep we just substituted the individual instantaneous rewards obtained by summing everyone's instantaneous reward on that step.

We didn't try VDN or QMIX or any other "cooperative MARL" algorithms. I'm very curious what the result would be for these though. I think there's a nice paper waiting to be written by someone who wants to look into this.

We also did not try using the third person global observation (WORLD.RGB). It think it is justifiable under the "centralized training and decentralized execution paradigm", though it depends a bit on your interpretation. If you are happy to assume free access to anything the simulator can produce during training then why even stop with the global observation? You might as well also use other debug signals too. Of course none of these are available at test time so the algorithm would need to be able to handle that.

Aside from using the third person global observation, another representation that could be used in centralized training is obtained by concatenating individual observations from all players together. That's the representation we used in the centralized training phase for OPRE, which was the only algorithm we tested so far that was actually designed for the decentralized training + centralized execution regime. Regarding OPRE's prosocial variant, it's fair to describe it as using a centralized critic. Though the OPRE algorithm was designed mainly for non-cooperative games, especially zero sum ones, not cooperative MARL. All the other algorithms we tried were designed for the fully decentralized regime (decentralized even at training time).

I expect that performance improvements could be obtained by using more of the available global information during centralized training. But it's not completely obvious that it would work. It might cause agents to be more overfit to one another's behavior. If so, then they would generalize poorly when faced with unfamiliar co-players at test time. I think a thorough investigation of this issue, probably including algorithms like VDN and QMIX, would make for an interesting paper in its own right. Someone should write that paper!

kinalmehta commented 2 years ago

Thank you for the detailed response. I completely missed the supplementary material. Two more questions.

  1. Regarding the one-hot “layers” encoding. Could you please point me to where I can extract those from?
  2. I was looking into the shape of the used observations and noticed that some environments have INVENTORY shape to be (3,) (e.g. Arena Running With Scissors In The Matrix, Pure Coordination In The Matrix) where as others have (2,). And for RGB, in “Collaborative Cooking”, has its shape as (40,40,3) where as for others it is (88,88,3). So the network architecture is adapted to these changes right?
  3. Any specific kind of padding used in the CNN layers?
  4. Any specific reason for concatenation the previous action too? Or is it just for better context of the chosen plan by that policy in previous steps? As I still believe the hidden state must have that information already in it.
jzleibo commented 2 years ago

Regarding the one-hot “layers” encoding. Could you please point me to where I can extract those from?

You should be able to get it by including "LAYER" in the substrate's config.individual_observation_names (e.g. here). Then if you want your agent to use it for inference you would have to also make sure to extract the key "LAYER" from the observation and pass it to the neural network. You could, for instance, replace "RGB" with "LAYER" throughout.

If you want a third person, global layers view then you can also use "WORLD.LAYER". It's analogous to "WORLD.RGB".

I was looking into the shape of the used observations and noticed that some environments have INVENTORY shape to be (3,) (e.g. Arena Running With Scissors In The Matrix, Pure Coordination In The Matrix) where as others have (2,). And for RGB, in “Collaborative Cooking”, has its shape as (40,40,3) where as for others it is (88,88,3). So the network architecture is adapted to these changes right?

Yes, that's true. The sizes of the observations differ a bit for those substrates. Also the two player version of Running With Scissors in The Matrix will have observation size of (40, 40, 3) like the collaborative cooking substrates. The inventories are either size two or size three depending on whether or not the substrate in question has two or three different kinds of resources to collect.

The network architectures will need to be adapted to these changes, but it might not require a change to any code. Most neural net specification libraries let you define a network layer just by specifying its number of output units. The actual number of parameters to create then gets inferred automatically from the size of the input. This is how sonnet's linear module works for instance.

Any specific kind of padding used in the CNN layers?

You can try using the special strides we did. I'm not sure if they help or hinder though. Since we know the image is made by tiling 8x8 sized sprites you can get away with choosing kernel size and kernel stride to be 8. We made this choice early on after we started using DMLab2D and never investigated its implications for meltingpot substrates. In theory it should be making things a bit faster. Though, anecdotally, one of my colleagues mentioned he had tried reverting to a more normal stride value and found that doing so improved performance. So it's a bit unclear right now what the best thing to do is. We'll try and include specific suggested conv net parameters when we next release an update to the repo.

In the mean time, feel free to use the 2-layer conv net settings that we used. They are as follows:

SPRITE_SIZE = 8

padding = VALID
num_output_channels = (16, 32)
kernel_shapes = (SPRITE_SIZE, 4)
strides = (SPRITE_SIZE, 1)

Any specific reason for concatenation the previous action too? Or is it just for better context of the chosen plan by that policy in previous steps? As I still believe the hidden state must have that information already in it.

This is a pretty standard thing at DeepMind. I agree with you though. It seems like redundant information. It's probably not necessary. We left it there because, since it's so common at DeepMind, it felt like removing it would be an unnecessary departure from a common default. It very likely makes no difference one way or another.