google-deepmind / meltingpot

A suite of test scenarios for multi-agent reinforcement learning.
Apache License 2.0
626 stars 122 forks source link

Result on collaborative_cooking_impassable_0 has very large variance #57

Closed YetAnotherPolicy closed 2 years ago

YetAnotherPolicy commented 2 years ago

Hi all, I ran experiments on collaborative_cooking_impassable_0 and the following is the evaluation result. It seems it has large variance. In your paper, the best result is 268 (shown below, you can find it in page 28 of the paper). What is the variance of the result? Is my result normal? Will you share the variance of the results in the future release?

image image
jagapiou commented 2 years ago

Eyeballing your graph and mine, the variance seems similar. Our exploiter got 269 (std 150, sem 5), our A3C got 211 (std 160, sem 3).

We're planning the next release now. I'll look into ways of releasing these sorts of stats in the future.

YetAnotherPolicy commented 2 years ago

Eyeballing your graph and mine, the variance seems similar. Our exploiter got 269 (std 150, sem 5), our A3C got 211 (std 160, sem 3).

We're planning the next release now. I'll look into ways of releasing these sorts of stats in the future.

Thanks for the response. I am looking forward to the new release.

YetAnotherPolicy commented 2 years ago

Hi @jagapiou, I also noticed such a pattern on collaborative_cooking_passable_1. It seems the result of A3C is random as well. Why is training collaborative_cooking hard? Is it due to the sparse reward? The variance is large and it seems hard to make it converge to a better value.

image

image

jagapiou commented 2 years ago

For this scenario, the trained bot does a lot by itself (but does a "one-pot") strategy. Ideally, your submitted population should learn to "help out" by running a two-pot strategy, or "ferrying" tomatoes to near the pot etc. (Watch videos of your agents to get an intuition of their strategy).

However, on this scenario, our exploiters are weak and don't learn such strategies. So actually I expect it will be possible to convincingly beat them on this (I think a two-pot strategy should achieve at least 200). The reason our exploiters are weak on this scenario is because it's the same problem as for the non-exploiter case but for N-1 which is still >1.

One way the N-player case is hard is that there's a credit-assignment issue with shared rewards and partial-observability. Consider: bot A drops food at the pass, agent B gets the shared reward, but agent B can't see bot A. So B may falsely conclude that the reward is random, and that it's actions have no impact. So B may learn to do nothing (a bit like "learned helplessness"). See the Melting Pot paper on "lazy agents" for discussion on this (end of section 7).

It's a fun problem!

YetAnotherPolicy commented 2 years ago

For this scenario, the trained bot does a lot by itself (but does a "one-pot") strategy. Ideally, your submitted population should learn to "help out" by running a two-pot strategy, or "ferrying" tomatoes to near the pot etc. (Watch videos of your agents to get an intuition of their strategy).

However, on this scenario, our exploiters are weak and don't learn such strategies. So actually I expect it will be possible to convincingly beat them on this (I think a two-pot strategy should achieve at least 200). The reason our exploiters are weak on this scenario is because it's the same problem as for the non-exploiter case but for N-1 which is still >1.

One way the N-player case is hard is that there's a credit-assignment issue with shared rewards and partial-observability. Consider: bot A drops food at the pass, agent B gets the shared reward, but agent B can't see bot A. So B may falsely conclude that the reward is random, and that it's actions have no impact. So B may learn to do nothing (a bit like "learned helplessness"). See the paper on "lazy agents" for discussion on this (end of section 7).

It's a fun problem!

@jagapiou Hey, thanks for the informative reply. After playing with collaborative_cooking_impassable, I noticed the credit-assignment issue too. Agents should identify the right action which exactly triggered the final reward (the ready soup) after many timestep delays. It is a great property for MARL research. It is cool and fun😁