Add A3C example with RlLib

alan-cooney commented 2 years ago

This demonstrates training on commons_harvest_open with the hyperparameters from the original Melting Pot paper.

Note this depends on #60

Co-authored-by: @Muff2n

alan-cooney commented 2 years ago

It would be great to get some feedback on this one - specifically:

I've trained the model successfully on a trimmed down version of commons harvest open, but don't have the compute to train 10^9 timesteps as per the paper, with 16 agents. Do you know roughly the compute resources you used to successfully train this? I'm trying to determine if there are efficiencies that can be added, or if this needs to use significant distributed compute.
The previous example had parameter sharing (as does this one) across agents during training. Did you do this in training for the paper?

This example is otherwise hopefully an improvement on self_play_train.py.

YetAnotherPolicy commented 2 years ago

It would be great to get some feedback on this one - specifically:

I've trained the model successfully on a trimmed down version of commons harvest open, but don't have the compute to train 10^9 timesteps as per the paper, with 16 agents. Do you know roughly the compute resources you used to successfully train this? I'm trying to determine if there are efficiencies that can be added, or if this needs to use significant distributed compute.

The previous example had parameter sharing (as does this one) across agents during training. Did you do this in training for the paper?

This example is otherwise hopefully an improvement on self_play_train.py.

Regarding the training, you can see this link: https://github.com/deepmind/meltingpot/issues/15

alan-cooney commented 2 years ago

Regarding the training, you can see this link: #15

Thanks @YetAnotherPolicy ! It looks like I'm getting similar performance to the results there, but I'll see if there is any tuning that can be done in RlLib.

YetAnotherPolicy commented 2 years ago

Regarding the training, you can see this link: #15

Thanks @YetAnotherPolicy ! It looks like I'm getting similar performance to the results there, but I'll see if there is any tuning that can be done in RlLib.

My pleasure.

ManuelRios18 commented 2 years ago

Hello @alan-cooney,

I have a question, is it okay that all the agents share the same policy? Or should they be independent learners?

Thanks in advance ! :)

alan-cooney commented 2 years ago

I have a question, is it okay that all the agents share the same policy? Or should they be independent learners?

Thanks in advance ! :)

Hi @ManuelRios18 - The original paper doesn't specify which approach was used in training (as far as I can tell), but it's pretty straight forward to switch to fully independent policies in RlLib. As I understand it, you just add another named policy to policies (e.g. you could name them player_, player_2 and so on) and then update the policy_mapping_fn to choose a different policy depending on the agent_id.

ManuelRios18 commented 2 years ago

I have a question, is it okay that all the agents share the same policy? Or should they be independent learners? Thanks in advance ! :)

Hi @ManuelRios18 - The original paper doesn't specify which approach was used in training (as far as I can tell), but it's pretty straight forward to switch to fully independent policies in RlLib. As I understand it, you just add another named policy to policies (e.g. you could name them player_, player_2 and so on) and then update the policy_mapping_fn to choose a different policy depending on the agent_id.

Hi @alan-cooney !

I am agree with you, that is how I would implement the independent learning scheme with RLlib,

I am not sure what did they do in the meltigpot paper but in the work where they introduce the commons harvest problem they state:

To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation

On the other hand, they also refer to independent learning in the Social Sequential Dilemma Paper:

We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q- network

It would be nice if any of the authors helps us to clarify that!

I could make a pull request to add independent leaners if needed.

jzleibo commented 2 years ago

We haven't had a chance to look at the pull request here yet. But I can answer the question about the training protocol.

We never used any parameter sharing between agents in any of our papers on sequential social dilemmas or related topics. They were always independent agents with their own neural networks, trained from their own observations.

The same is true for the melting pot paper. There was no parameter sharing in the baseline results reported there.

The "rules" of melting pot are intentionally agnostic on this though. If you want to share parameters between agents then that's perfectly fine.

alan-cooney commented 2 years ago

Thanks for confirming @jzleibo - I think it makes sense for this example to try and closely match the paper, so I'm going to move to draft until that's added in.

willis-richard commented 2 years ago

I do not think the RMSProp optimiser is used. From reading the code, it appears that A3C uses the default optimiser, which is Adam, and it is not configurable. See ray.rllib.policy.tf_policy.TFPolicy.optimizer(). I think this will have to be deleted from the example unfortunately.

google-deepmind / meltingpot

Add A3C example with RlLib #62