google-deepmind / meltingpot

A suite of test scenarios for multi-agent reinforcement learning.
Apache License 2.0
614 stars 122 forks source link

Training Details for A3C #56

Closed kinalmehta closed 2 years ago

kinalmehta commented 2 years ago

Hi,

It is mentioned in the MeltingPot paper that "Every agent participated in every training episode, with each agent playing as exactly one player (selected randomly on each episode)." - Section 7 Experiments But in A3C/IMPALA, the experience collection is de-coupled from the training loop. So how exactly the information about which agent was used to generate the experience for which player, is shared between the experience collection loop and training loop? E.g. Let the agent-player mapping be as follows for a 2-player environment be

As the experience will be stored in a replay buffer, and multiple episodes can be sampled from the buffer together(depending on batch size), the agent-player mapping can vary for each example in the batch. How exactly is this being handled?

Possible options:

Thanks Kinal

jagapiou commented 2 years ago

At learning-time the asynchronous actor-critic baseline agent ("A3C/IMPALA") processes trajectories of it's own observations/actions only. So it doesn't need to know which player slot it was in when that data was generated.

Only OPRE saw coplayer observations during learning (but not during behavior), and it received a batch of one-hot vectors to tell it which slot was it's own perspective.

kinalmehta commented 2 years ago

Thank you for your answer.

So if I understand correctly, even if mapping is {"agent_1":"player_1", "agent_2":"player_2"} during the experience collection time, the policy-gradients can be calculated with a different mapping say {"agent_2":"player_1", "agent_2":"player_1"}? (Note: here agent_* refers to the network being used to select an action or whose parameters are being updated using PG)

Is there any form of parameter-sharing being adopted? E.g., for CNN feature extraction.

jagapiou commented 2 years ago

I'm not sure what you mean about using a different mapping in the policy-gradient update. The mapping of agent to player slot is a behavior-time detail that is always irrelevant to the A3C/IMPALA agent:

  1. Policies are conditioned only on their own observations. They have to infer the existence of co-players from their observations. (Decentralized execution).
  2. In the A3C/IMPALA agent, the critic is conditioned only on the agent's own behavior-time observations so the exact behavior-time slot is irrelevant. It must also infer the existence of co-players. (Decentralized learning).

We did no parameter-sharing between agents (or data-sharing), so they were only coupled by correlated experience arising from being in the same episodes (from different perspectives).