Closed GoingMyWay closed 2 years ago
The only difference between the vanilla and prosocial variants was that the prosocial variants used np.mean(timestep.reward)
instead of timestep.reward[player_index]
as the reward during training (e.g. to train the critic).
We did this by modifying the Substrate observations by adding a substrate_transform
to add coplayer rewards (via all_observations_wrapper
). This allowed the training step to calculate the mean from the observations in the generated trajectories.
Note that none of our agents' policies were conditioned on reward, so we didn't use this information at behavior time (nor at test time).
This is important because since the Scenario is a test of the trained policy, we consider it "cheating" to modify the Scenario observations (e.g. things like explicit coordination directives). The timestep reward at test time can therefore only be timestep.reward[player_index]
, and you shouldn't train the policies under test to expect anything else.
The only difference between the vanilla and prosocial variants was that the prosocial variants used
np.mean(timestep.reward)
instead oftimestep.reward[player_index]
as the reward during training (e.g. to train the critic).We did this by modifying the Substrate observations by adding a
substrate_transform
to add coplayer rewards (viaall_observations_wrapper
). This allowed the training step to calculate the mean from the observations in the generated trajectories.Note that none of our agents' policies were conditioned on reward, so we didn't use this information at behavior time (nor at test time).
This is important because since the Scenario is a test of the trained policy, we consider it "cheating" to modify the Scenario observations (e.g. things like explicit coordination directives). The timestep reward at test time can therefore only be
timestep.reward[player_index]
, and you shouldn't train the policies under test to expect anything else.
@jagapiou Dear John, thanks for the clarification. In my code, I use timestep.reward
from the substrate for MARL training. I think that is what you mentioned.
reward = np.sum(timestep.reward)/n_agents
Thank you!
Dear Authors,
Prosocial agents perform well and I found there is not much text on training prosocial agents in the paper. In your paper,
Here, how is the loss function defined? I also found
Is it similar to the training paradigm of MADDPG and QMIX?
Thanks in advance.