google-deepmind / meltingpot

A suite of test scenarios for multi-agent reinforcement learning.
Apache License 2.0
624 stars 122 forks source link

Training procedure in Collaborative Cooking. #39

Closed grig-guz closed 2 years ago

grig-guz commented 2 years ago

Hi, I was wondering if you used any intrinsic rewards/exploration methods besides an entropy bonus for training agents in collaborative cooking environments? I am using A3C and I'm having trouble getting any rewards in those environments due to sparsity.

jzleibo commented 2 years ago

Hi!

In order to make the baseline results reported in the overview paper as simple and pure as possible, we didn't use any pseuodrewards, intrinsic motivations, or exploration bonuses for any of the environments.

You are correct that the collaborative cooking environments are very difficult without the domain-specific pseudorewards. The baseline algorithms we tested all got scores of 0.0, exactly as you are reporting. You can see this number in the table in Appendix F of the paper. They get scores of 0.0 in self play, i.e. in training. They get higher scores in some of the scenarios because in those cases they are paired with a background population that is already skilled in the game.

I believe most research on collaborative cooking environments uses domain-specific pseudorewards to get initial learning off the ground. Our implementation also supports these. Check the original papers for the exact details, but I believe that by far the most important pseudoreward is the one for putting tomatoes in cooking pots. You can enable it by editing this line to set reward = 1.

You may also be interested to know that in the next release -- which is coming very soon! -- we will be adding to the suite several more collaborative cooking environments, which will more closely correspond to the maps used in other papers.

grig-guz commented 2 years ago

Thank you very much, that answers all my questions.

GoingMyWay commented 2 years ago

I also found training Collaborative Cooking is not easy with shared parameters of agent networks. The rewards are zeros during the training of the substrate.

@jzleibo By the way, in Melting Pot, are there any substrates that are easier to train? I found training MARL methods that have experience replay need much larger RAM than Atari games since there are many agents. I think on-policy methods cost less RAM.