Closed kinalmehta closed 2 years ago
At learning-time the asynchronous actor-critic baseline agent ("A3C/IMPALA") processes trajectories of it's own observations/actions only. So it doesn't need to know which player slot it was in when that data was generated.
Only OPRE saw coplayer observations during learning (but not during behavior), and it received a batch of one-hot vectors to tell it which slot was it's own perspective.
Thank you for your answer.
So if I understand correctly, even if mapping is {"agent_1":"player_1", "agent_2":"player_2"}
during the experience collection time, the policy-gradients can be calculated with a different mapping say {"agent_2":"player_1", "agent_2":"player_1"}
?
(Note: here agent_*
refers to the network being used to select an action or whose parameters are being updated using PG)
Is there any form of parameter-sharing being adopted? E.g., for CNN feature extraction.
I'm not sure what you mean about using a different mapping in the policy-gradient update. The mapping of agent to player slot is a behavior-time detail that is always irrelevant to the A3C/IMPALA agent:
We did no parameter-sharing between agents (or data-sharing), so they were only coupled by correlated experience arising from being in the same episodes (from different perspectives).
Hi,
It is mentioned in the MeltingPot paper that "Every agent participated in every training episode, with each agent playing as exactly one player (selected randomly on each episode)." - Section 7 Experiments But in A3C/IMPALA, the experience collection is de-coupled from the training loop. So how exactly the information about which agent was used to generate the experience for which player, is shared between the experience collection loop and training loop? E.g. Let the agent-player mapping be as follows for a 2-player environment be
As the experience will be stored in a replay buffer, and multiple episodes can be sampled from the buffer together(depending on batch size), the agent-player mapping can vary for each example in the batch. How exactly is this being handled?
Possible options:
Thanks Kinal