How to support multi-agent reinforcement learning

youkaichao commented 4 years ago

[ ] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [ ] documentation request (i.e. "X is missing from the documentation.")
- [x] new feature request
[x] I have visited the source website, and in particular read the known issues
[x] I have searched through the issue tracker and issue categories for duplicates

[ ] I have mentioned version numbers, operating system and environment, where applicable:

import tianshou, torch, sys
print(tianshou.__version__, torch.__version__, sys.version, sys.platform)

youkaichao commented 4 years ago

This issue can be used to track the design of multi-agent reinforcement learning implementation.

youkaichao commented 4 years ago

After some pilot study, I find there are three paradigms of multi-agent reinforcement learning:

simultaneous move, at each timestep, all the agent take their actions (example: moba games)
cyclic move, players take action in turn (example: Go game)
conditional move, at each timestep, the environment conditionally selects an agent to take action. (example: Pig Game)

The problem is how to transform these paradigms to the following standard RL procedure:

action = policy(state)
next_state, reward = env.step(action)

For simultaneous move, the solution is simple: we can just add a num_agent dimension to state, action, reward. Nothing else is going to change.

For 2 & 3 (cyclic move & conditional move), an elegant solution is:

action = policy(state, agent_id)
next_state, next_agent_id, reward = env.step(action)

By constructing new state state_ma = {'state':state, 'agent_id':agent_id}, essentially we can go back to the standard case:

action = policy(state_ma)
next_state_ma, reward = env.step(action)

Just be careful that reward here can be rewards for all the players (the action of one player may affect all players).

Usually, the legal action set varies with state, so it is more convenient to denote state_ma = {'state':state, 'legal_actions':legal_actions, 'agent_id':agent_id}.

duburcqa commented 4 years ago

I don't think there is a need for multi-agent reinforcement learning at short term. To me the priority is to improve the current functionality. There is major flaws in the current implementation, especially regarding the efficiency of distributed sampling, which are way more critical to handle than adding new features. The core has be to robust before building on top of it.

Yet, of course, it is still interesting to discuss the implementation of future features!

Trinkle23897 commented 4 years ago

I don't think there is a need for multi-agent reinforcement learning at short term. To me the priority is to improve the current functionality. There is major flaws in the current implementation, especially regarding the efficiency of distributed sampling, which are way more critical to handle than adding new features. The core has be to robust before building on top of it.

Yet, of course, it is still interesting to discuss the implementation of future features!

This feature does not change any of the current code and is also compatible with 2d-buffer. I think it is independent with what you said.

duburcqa commented 4 years ago

I think it is independent with what you said.

Of course it is! Work force being limited, someone working on a specific feature necessarily affects / slow-down the development of the others. That's the point of prioritizing development of some features wrt to others, it is often not because of interdependencies, but rather because of limited work force.

But obviously, in the setup of an open-source project, anyone is free to work on the features they want to.

youkaichao commented 4 years ago

I gave an example of playing Tic Tac Toe in the test case, without modifying the core code.

What is the next step seems unclear. I do not know how people in MARL typically train their model, especially how to sample experience.Some possible ideas may be:

each player only learns on his own experience
each player leans on experience of all players

Which one is commonly used? Or both? Or there are other paradigms? This should be cleared by some experts in MARL area.

youkaichao commented 4 years ago

Issue #136 is a great discussion on multi-agent rl with simultaneous move. The conclusion is that, it can be dealt with, without modifying the core of Tianshou. Just to inherit some class and re-implement one function with several lines, depending on the specific scenario.

p-veloso commented 4 years ago

I am not an expert in MARL and I have just discovered Tianshou. With that said, here are some thoughts based on the papers I have been reading recently. There are many workflows for MARL with different training (centralized/decentralized), execution (centralized/decentralized), type of agents (homogeneous/heterogeneous), task setting (cooperative/competitive/mixed), types of reward (individual/global), etc. Here are some common MARL types:

independent policies (- -)
shared individual policy (parameter sharing, see #136) (+ +)
centralized training and decentralized execution (+ -)
centralized training and execution (- +)

The problem with 1 is that it tends not to converge and might require managing different networks in training and execution (heterogeneous agents). Type 2 and Type 3 are strong trends in MARL. Type 2 can use the algorithms already implemented in Tianshou with small changes in the policy or collector. Type 3 would require implementing specific algorithms (ex: QMIX, MADDPG or COMA), like in RLlib. I think type 4 can be implemented using Tianshou with no change, but in practice it is hard to scale, as the joint action space grows exponentially in the number of agents.

Therefore, type 2 might be a good start. If there is a large interest in making Tianshou a MARL library, maybe it is worth developing 3.

youkaichao commented 4 years ago

There are many workflows for MARL with different training (centralized/decentralized), execution (centralized/decentralized), type of agents (homogeneous/heterogeneous), task setting (cooperative/competitive/mixed), types of reward (individual/global), etc.

Thank you for your comment and the overall description of MARL algorithms. It seems MARL has many variants and it is difficult to support them once at all. We have to support MARL step by step.

youkaichao commented 4 years ago

Some updates:

to support centeralized training and decenteralized execution, one can inherit the tianshou.policy.MultiAgentPolicyManager class to implement the train and evalfunction to act differently in different mode.
allow agents to see the state of other agents during training: wrap the environment to return the state of other agents in info.

jkterry1 commented 3 years ago

You guys might want to take a look at this: https://github.com/PettingZoo-Team/PettingZoo

p-veloso commented 3 years ago

I noticed that the cheat sheet for rl states that tianshou supports simultaneous move (case 1):

"for simultaneous move, the solution is simple: we can just add a num_agent dimension to state, action, and reward. Nothing else is going to change".

**Question: Is it possible to use the MultiAgentPolicyManager to work with

varied number of agents (each simulation has a different number of agents)
parameter sharing (every agent shares the same policy, so the agent data can be merged in the batch for training)
simultaneous actions?**

Last year (#136) I had to "tweak" the policy and net classes by

merging num agents and batch
adapting the operations
unmerging batch and num agents ...but it would be ideal to use the new class for that.

Trinkle23897 commented 3 years ago

varied number of agents (each simulation has a different number of agents)

Sorry about that, that is actually beyond the current scope. I haven't come up with a good design choice for this kind of requirement...

p-veloso commented 3 years ago

Isn't that just the case of adding a dimension of n_agent to the actions and observations, like I did last year and adapting the operations in the policy to that? When you reset the environment, it will set a new number of agents for that episode.

Trinkle23897 commented 3 years ago

hmm yep, you're right

p-veloso commented 3 years ago

Assuming that I can fix the number of agents for a certain period of training, does the class MultiAgentPolicyManager support parameter sharing (single policy for multiple agents) and simultaneous actions? If not, are there other classes that can support it?

Trinkle23897 commented 3 years ago

Can you pass the same reference into MAPM, i.e., MultiAgentPolicyManager([policy_1, policy_1, policy_2])? I think that would be fine to some extend.

p-veloso commented 3 years ago

That might work, but I think it will train the same network with separate batches, right?

        results = {}
        for policy in self.policies:
            data = batch[f"agent_{policy.agent_id}"]
            if not data.is_empty():
                out = policy.learn(batch=data, **kwargs)
                for k, v in out.items():
                    results["agent_" + str(policy.agent_id) + "/" + k] = v
        return results

Trinkle23897 commented 3 years ago

Exactly, that should be a tiny issue but I think it would be fine for agent to learn, though it is a little bit inefficient.

p-veloso commented 3 years ago

Thanks for the quick reply, @Trinkle23897 . As I mentioned before, last year I changed the code of the policies and networks to manage the extra agent dimension in the batch. The problem of that approach is that I had to change the code for the specific algorithm, so it might be tricky to compare multiple algorithms. Do you think that customizing the MultiAgentPolicyManager would be a more general solution in my case? Or would I still have to deal with specific changes in the policy and network classes?

Trinkle23897 commented 3 years ago

I had to change the code for the specific algorithm

I don't quite understand. Do you mean you use different algorithms in the same MAPM? Current implementation only accepts a list of on-policy algorithms or a list of off-policy algorithms. It would be no code changes if you use something like [dqn_agent, c51_agent, c51_agent].

p-veloso commented 3 years ago

No. I am using a single policy.

What I mean is that last year I changed the policy of PPO and the neural network to deal with parameter sharing and simultaneous actions. I changed the code of tianshou to manage the additional dimension in my setting.

The environment sends observations, rewards, etc. with the number of agents as the first dimension (n_agents, ?) This results in a batch with (batch dim, n agents, ?) The policy and networks use (batch x n_agents, ?) for updates and learning The resulting actions are then reshaped to (batch, n_agents, ?) to be returned to the environments

These changes were made in 1) PPO: compute_episodic_return: change how m is calculated learn: change how ratio and u are calculated

2) NETWORK forward: merge batch and agent dims, do the forward pass, unmerge batch and agents dimensions

So, my point is that it would be tricky to do that for multiple algorithms. So, I am curious if I could address these changes only by customizing my own multi agent policy manager (similar to the MultiAgentPolicyManager, but with one policy and the changes mentioned above done directly in its methods).

Trinkle23897 commented 3 years ago

Yeah, you can definitely do that. In MAPM the only thing need to do is to re-organize the Batch to buffer-style data (reshape or flatten to let the 1st dim be n_agent*bsz, also pay attention to done flag) and that would be the same as PPO single-agent.

jkterry1 commented 3 years ago

Could I offer a thought? You might want to consider using the PettingZoo parallel API (it has two). The parallel API isn't significantly different what I understand you're proposing your 1.0 one to be, and this way you're using a standard instead of a custom one. A bunch of third party libraries for PettingZoo already exist (e.g. p-veloso above has one), and RLlib and stable baselines interface with it, as do several more minor RL libraries.

Trinkle23897 commented 3 years ago

Could I offer a thought? You might want to consider using the PettingZoo parallel API (it has two). The parallel API isn't significantly different what I understand you're proposing your 1.0 one to be, and this way you're using a standard instead of a custom one. A bunch of third party libraries for PettingZoo already exist (e.g. p-veloso above has one), and RLlib and stable baselines interface with it, as do several more minor RL libraries.

That's pretty cool! I'm wondering if you have any interest in integrating this standard library into tianshou.

jkterry1 commented 3 years ago

Sure we can do that! It'll probably take a few weeks though

p-veloso commented 3 years ago

Can you pass the same reference into MAPM, i.e., MultiAgentPolicyManager([policy_1, policy_1, policy_2])? I think that would be fine to some extend.

@Trinkle23897, For now, I gave up on my previous approach (manually changing the shape of the batch), because it would require tweaking some of the algorithms that rely on the order of the observations (e.g., GAE, multi-step value prediction, etc.). I am trying the approach that you cited above, but there is a problem:

The environment should produce all the agents' observations at once. Therefore, instead of a dict {"agent_id": ... , "obs": ... "mask": ...} I have a list with n of these dictionaries which results in a Batch with wrong format.
I created a preprocess_fn to correct that (see below) , but this ends up resulting in errors in other parts of the code that rely on the assumption that the length of the data is equal the number of environments or workers ... but in reality each environment produces n_agents observations per step (i.e., one for each agent).

While I can start customizing the classes (e.g., BaseVectorEnv) to make it work, I feel this is a structural part of the library and it might take a lot of time to solve. Any suggestion for a simpler approach that preserve simultaneous actions?

def preprocess_fn(**kwargs):
"""convert from lists to dict"""
# if only obs exist -> reset
# if obs_next/act/rew/done/policy exist -> normal step
agent_idx_array = []
obs_array = []
mask_array = []
if 'rew' not in kwargs:
    tag = "obs"
else:
    tag = "obs_next"
for env_idx in range(len(kwargs[tag])):
    for obs_dict_idx in range(len(kwargs[tag][env_idx])):
        agent_idx_array.append(kwargs[tag][env_idx][obs_dict_idx]["agent_id"])
        obs_array.append(kwargs[tag][env_idx][obs_dict_idx][tag])
        mask_array.append(kwargs[tag][env_idx][obs_dict_idx]["mask"])
obs_batch = Batch(agent_id = np.array(agent_idx_array), obs = np.array(obs_array), mask = np.array(mask_array))
new_batch = Batch()
if 'rew' not in kwargs:
    new_batch.obs = obs_batch
else:
    new_batch.obs_next = obs_batch
return new_batch

Trinkle23897 commented 3 years ago

I go through the above discussion again. So if the environment produces all agent's step simultaneously and uses only one policy, there's no need to use/follow MAPM. Instead, treat this environment as a normal single-agent environment, e.g.,

the new obs_space would be [num_agents] + original obs_space.shape,
the new action_space would be [num_agents] + original action_space.shape,
reward should be a numpy array
done is still a bool
info should be the same

and then use normal way of collector/buffer (a transition in buffer stores several agent's obs/act/rew/... in this timestep), one thing you need to do is to customize your network to squash the second dimension (num_agent) into the first dimension (batch size). Since this reward is an array instead of scalar, you should pass reward_metric into trainer (and I'm not sure if any of the existing code should be tuned, like GAE, but nstep supports this feature, see #266 and https://github.com/thu-ml/tianshou/pull/266/commits/3695f126b1b8a340b1c72801fc13c69e26b33522).

Please let me know if there's anything unclear (or maybe I misunderstood some parts lol)

p-veloso commented 3 years ago

one thing you need to do is to customize your network to squash the second dimension (num_agent) into the first dimension (batch size)

Yes. In a previous version of tianshou I tried to fix that with the following changes:

NETWORK forward: merge batch and agent dims, do the forward pass, unmerge batch and agents dimensions

    def forward(self, s, state=None, info={}):
        shape = s.shape
        s = to_torch(s, device=self.device, dtype=torch.float)
        edit = len(s.shape) == 5
        if edit:
            s = s.view(shape[0] * shape[1], shape[2], shape[3], shape[4]).cuda()
        logits = self.model(s)
        if edit:
            logits = logits.view(shape[0], shape[1], -1)
        return logits, state

But that is not enough, because the single agent algorithms also assume that shape, so I have to change the batch or change them directly, such as

PPO:

compute_episodic_return: change how m is calculated

    ## ORIGINAL
    m = (1. - batch.done) * gamma #

    ## ADAPTATION I was using a single done ...so in these 2 lines, I made sure done has the same shape as rew
    d = np.repeat(batch.done, batch.rew.shape[1]).reshape(batch.rew.shape) 
    m = (1. - d) * gamma

learn: change how ratio and u are calculated

            ## ORIGINAL
            ratio = (dist.log_prob(b.act) - b.logp_old).exp().float()

            # ADAPTATION merge batch and agents
            s = b.adv.shape
            merged_shape = s[0] * s[1]
            b.adv, b.logp_old, b.returns, b.v = \
                b.adv.reshape(merged_shape), \
                b.logp_old.reshape(merged_shape), \
                b.returns.reshape(merged_shape), \
                b.v.reshape(merged_shape)
            value = value.reshape(s[0] * s[1])
            d_log_prob = dist.log_prob(b.act)
            d_log_prob = d_log_prob.reshape(s[0] * s[1])
            ratio = (d_log_prob - b.logp_old).exp().float()

            ...

            #AT THE END unmerge batch and agents
            b.adv, b.logp_old, b.returns, b.v = \
                b.adv.reshape(s[0], s[1], s[2]), \
                b.logp_old.reshape(s[0], s[1]),\
                b.returns.reshape(s[0], s[1], s[2]), \
                b.v.reshape(s[0], s[1], s[2])

As I mentioned in another post, the problems with this approach are:

I would have to change every algorithm
When I merge the first dimensions of the batch, I might have problems with parts of the algorithm that assume that the merged dimensions is a sequence collected over time and not n sequences that are "flattened".

That is why I was looking for a more general approach outside of the policies. I tried your idea of repeating the same policy in the policy manager for each agent, which is nice because each policy will only deal with a batch of one of the agents separately. However, that

required a custom preprocess_fn to convert the multi-agent observation to the right format (per agent_id).
resulted in errors because the increment in the size of the resulting batches breaks many of the assertions, such as

in BaseVectorEnv: assert len(self.data) == len(ready_env_ids) in Collector: assert len(action) == len(id)

Trinkle23897 commented 3 years ago

compute_episodic_return: change how m is calculated

learn: change how ratio and u are calculated

I would have to change every algorithm

Yep, that's what I was doing previously. ~But I don't think there's a free lunch that can use single-agent codebase to support multi-agent with little modifying and no performance decrease (for wall-time).~ But ... wait

in BaseVectorEnv: assert len(self.data) == len(ready_env_ids) in Collector: assert len(action) == len(id)

Yeah I thought about this approach last night. Let's say if you have 4 envs and each env needs 5 agent, so that you need a VectorReplayBuffer(buffer_num=20) (num_agents x num_env). However, currently the high-level module such as collector doesn't know the exact format of this given batch (e.g., how to split a batch with batch_size==4 into 20 buffer slots).

So this comes to one natural way: construct another vector env that inherit existing BaseVectorEnv, where:

__len__ should return num_agent x num_env
step and reset should unpack the corresponding result ([num_env, num_agent, ...] into [env1_agent1, env1_agent2, ..., env1_agentn, ..., envn_agentn]) (and also pack actions, the input action length is num_env x num_agent, it should resize to [num_env, num_agent, ...] and send to each environment with [num_agent, ...])
info["env_id"] should change accordingly (from num_env to num_agent x num_env)

therefore you can execute only num_env envs but get num_env x num_agent results at each step without modifying the agent's code and without using MAPM.

p-veloso commented 3 years ago

@Trinkle23897

I spent this last day trying the different approaches. I have just solved the "policy approach" for DQN in the latest version of the Tianshou, but I would definitely prefer a higher-level modification that can work with all the original policies. I think your suggestion is similar to what supersuit does ... but I had no idea how to do that in Tianshou. Thanks for the suggestion.

Just for clarification, according to your current idea, would I still need to change other parts, such as the forward pass of the neural network?

Trinkle23897 commented 3 years ago

Just for clarification, according to your current idea, would I still need to change other parts, such as the forward pass of the neural network?

None of them I think.

p-veloso commented 3 years ago

It works! Thanks again.

benblack769 commented 3 years ago

@p-veloso I just saw this, but while the supersuit example here: https://github.com/PettingZoo-Team/SuperSuit#parallel-environment-vectorization is for stable baselines, all it does is translate the parallel environment into a vector environment. Since tianshou support vector environments out of box for all algorithms, you should just be able to use supersuit's environment vectorization, rather than your own custom code. If there is some reason it doesn't work out of box, feel free to raise an issue with supersuit asking for support for tianshou.

thu-ml / tianshou

How to support multi-agent reinforcement learning #121