google-deepmind / meltingpot

A suite of test scenarios for multi-agent reinforcement learning.
Apache License 2.0
583 stars 118 forks source link

Training prosocial agents #43

Closed GoingMyWay closed 2 years ago

GoingMyWay commented 2 years ago

Dear Authors,

Prosocial agents perform well and I found there is not much text on training prosocial agents in the paper. In your paper,

Since the prosocial agents were explicitly trained to maximize the reward of other players, it is unsurprising that they would most benefit the background population.

Here, how is the loss function defined? I also found

We also trained prosocial variants of all three algorithms, which directly optimized the per-capita return (rather than individual return), by sharing reward between players during training.

Is it similar to the training paradigm of MADDPG and QMIX?

Thanks in advance.

jagapiou commented 2 years ago

The only difference between the vanilla and prosocial variants was that the prosocial variants used np.mean(timestep.reward) instead of timestep.reward[player_index] as the reward during training (e.g. to train the critic).

We did this by modifying the Substrate observations by adding a substrate_transform to add coplayer rewards (via all_observations_wrapper). This allowed the training step to calculate the mean from the observations in the generated trajectories.

Note that none of our agents' policies were conditioned on reward, so we didn't use this information at behavior time (nor at test time).

This is important because since the Scenario is a test of the trained policy, we consider it "cheating" to modify the Scenario observations (e.g. things like explicit coordination directives). The timestep reward at test time can therefore only be timestep.reward[player_index], and you shouldn't train the policies under test to expect anything else.

GoingMyWay commented 2 years ago

The only difference between the vanilla and prosocial variants was that the prosocial variants used np.mean(timestep.reward) instead of timestep.reward[player_index] as the reward during training (e.g. to train the critic).

We did this by modifying the Substrate observations by adding a substrate_transform to add coplayer rewards (via all_observations_wrapper). This allowed the training step to calculate the mean from the observations in the generated trajectories.

Note that none of our agents' policies were conditioned on reward, so we didn't use this information at behavior time (nor at test time).

This is important because since the Scenario is a test of the trained policy, we consider it "cheating" to modify the Scenario observations (e.g. things like explicit coordination directives). The timestep reward at test time can therefore only be timestep.reward[player_index], and you shouldn't train the policies under test to expect anything else.

@jagapiou Dear John, thanks for the clarification. In my code, I use timestep.reward from the substrate for MARL training. I think that is what you mentioned.

reward = np.sum(timestep.reward)/n_agents

Thank you!