openai / maddpg

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"
https://arxiv.org/pdf/1706.02275.pdf
MIT License
1.6k stars 484 forks source link

It seems that the training is decentralized? #7

Closed pengzhenghao closed 6 years ago

pengzhenghao commented 6 years ago

I have looked through the train.py and found that your guys provide each agent an trainer:

def get_trainers(env, num_adversaries, obs_shape_n, arglist):

    trainers = []

    model = mlp_model

    trainer = MADDPGAgentTrainer

    for i in range(num_adversaries):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.adv_policy=='ddpg')))

    for i in range(num_adversaries, env.n):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.good_policy=='ddpg')))

    return trainers

It seems that maybe I have wrong understanding. But decentralize training in my understanding mean using an identical model to learn the q function, so therefore I can't understanding why assign each agent a trainer(include the model in the trainer), since you get so many model to train rather than only one.

And I found that even you use Reuse=True in the setting of tf.variable_scope, but each model of the trainer of a agent has name like "agent_0/fully_connected/weights". That means all the weights and bias of the model are not exactly the same. Namely, agent_0 has it's own model, agent_1 has it's own model, ...

So how could you say your training of this multi-agents system is centralized?

Look forward to your reply! Thanks!

ryan-lowe commented 6 years ago

Hi, It is true that each agent learns its own policy and Q-function. The training is centralized in the sense that the inputs to each Q function depend on the actions and observations of all the agents. Usually, for the training to be considered fully 'decentralized', each agent's policy and value are only functions of that agent's observation and actions. This is consistent with other papers in the literature (see e.g. the similar work on COMA, https://arxiv.org/pdf/1705.08926.pdf)