Closed alan-cooney closed 2 years ago
It would be great to get some feedback on this one - specifically:
This example is otherwise hopefully an improvement on self_play_train.py
.
It would be great to get some feedback on this one - specifically:
- I've trained the model successfully on a trimmed down version of commons harvest open, but don't have the compute to train 10^9 timesteps as per the paper, with 16 agents. Do you know roughly the compute resources you used to successfully train this? I'm trying to determine if there are efficiencies that can be added, or if this needs to use significant distributed compute.
- The previous example had parameter sharing (as does this one) across agents during training. Did you do this in training for the paper?
This example is otherwise hopefully an improvement on
self_play_train.py
.
Regarding the training, you can see this link: https://github.com/deepmind/meltingpot/issues/15
Regarding the training, you can see this link: #15
Thanks @YetAnotherPolicy ! It looks like I'm getting similar performance to the results there, but I'll see if there is any tuning that can be done in RlLib.
Regarding the training, you can see this link: #15
Thanks @YetAnotherPolicy ! It looks like I'm getting similar performance to the results there, but I'll see if there is any tuning that can be done in RlLib.
My pleasure.
Hello @alan-cooney,
I have a question, is it okay that all the agents share the same policy? Or should they be independent learners?
Thanks in advance ! :)
I have a question, is it okay that all the agents share the same policy? Or should they be independent learners?
Thanks in advance ! :)
Hi @ManuelRios18 - The original paper doesn't specify which approach was used in training (as far as I can tell), but it's pretty straight forward to switch to fully independent policies in RlLib. As I understand it, you just add another named policy to policies
(e.g. you could name them player_
, player_2
and so on) and then update the policy_mapping_fn
to choose a different policy depending on the agent_id
.
I have a question, is it okay that all the agents share the same policy? Or should they be independent learners? Thanks in advance ! :)
Hi @ManuelRios18 - The original paper doesn't specify which approach was used in training (as far as I can tell), but it's pretty straight forward to switch to fully independent policies in RlLib. As I understand it, you just add another named policy to
policies
(e.g. you could name themplayer_
,player_2
and so on) and then update thepolicy_mapping_fn
to choose a different policy depending on theagent_id
.
Hi @alan-cooney !
I am agree with you, that is how I would implement the independent learning scheme with RLlib,
I am not sure what did they do in the meltigpot paper but in the work where they introduce the commons harvest problem they state:
To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation
On the other hand, they also refer to independent learning in the Social Sequential Dilemma Paper:
We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q- network
It would be nice if any of the authors helps us to clarify that!
I could make a pull request to add independent leaners if needed.
We haven't had a chance to look at the pull request here yet. But I can answer the question about the training protocol.
We never used any parameter sharing between agents in any of our papers on sequential social dilemmas or related topics. They were always independent agents with their own neural networks, trained from their own observations.
The same is true for the melting pot paper. There was no parameter sharing in the baseline results reported there.
The "rules" of melting pot are intentionally agnostic on this though. If you want to share parameters between agents then that's perfectly fine.
Thanks for confirming @jzleibo - I think it makes sense for this example to try and closely match the paper, so I'm going to move to draft until that's added in.
I do not think the RMSProp optimiser is used. From reading the code, it appears that A3C uses the default optimiser, which is Adam, and it is not configurable. See ray.rllib.policy.tf_policy.TFPolicy.optimizer(). I think this will have to be deleted from the example unfortunately.
This demonstrates training on commons_harvest_open with the hyperparameters from the original Melting Pot paper.
Note this depends on #60
Co-authored-by: @Muff2n