Closed grig-guz closed 2 years ago
Hi!
In order to make the baseline results reported in the overview paper as simple and pure as possible, we didn't use any pseuodrewards, intrinsic motivations, or exploration bonuses for any of the environments.
You are correct that the collaborative cooking environments are very difficult without the domain-specific pseudorewards. The baseline algorithms we tested all got scores of 0.0, exactly as you are reporting. You can see this number in the table in Appendix F of the paper. They get scores of 0.0 in self play, i.e. in training. They get higher scores in some of the scenarios because in those cases they are paired with a background population that is already skilled in the game.
I believe most research on collaborative cooking environments uses domain-specific pseudorewards to get initial learning off the ground. Our implementation also supports these. Check the original papers for the exact details, but I believe that by far the most important pseudoreward is the one for putting tomatoes in cooking pots. You can enable it by editing this line to set reward = 1.
You may also be interested to know that in the next release -- which is coming very soon! -- we will be adding to the suite several more collaborative cooking environments, which will more closely correspond to the maps used in other papers.
Thank you very much, that answers all my questions.
I also found training Collaborative Cooking is not easy with shared parameters of agent networks. The rewards are zeros during the training of the substrate.
@jzleibo By the way, in Melting Pot, are there any substrates that are easier to train? I found training MARL methods that have experience replay need much larger RAM than Atari games since there are many agents. I think on-policy methods cost less RAM.
Hi, I was wondering if you used any intrinsic rewards/exploration methods besides an entropy bonus for training agents in collaborative cooking environments? I am using A3C and I'm having trouble getting any rewards in those environments due to sparsity.