thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.95k stars 1.13k forks source link

Working with agent dimension in multi-agent workflows based on single policy (parameter sharing) #136

Closed p-veloso closed 4 years ago

p-veloso commented 4 years ago

I am looking for a simple library to implement parameter sharing in multi-agent RL using single-agent RL algorithms. I have just discovered Tianshou and it looks awesome, but I have a problem with the dimension of the data that represents the n of agents.

My project uses a custom grid-based environment where:

As far as I understand, Tianshou uses the collector both for getting the batches for the simulation (in one or multiple environments) and for retrieving batches for the training. Therefore, the

Notice that the number of samples in a batch from the perspective of the neural network (batch_size * n_agents) is different from the number of samples from the perspective of the environment (batch_size), which can be problematic. In the simulation, the agents should generate a coherent trajectory, so the n_agents dimension is important to indicate which action vectors should be passed to which environment. I can use the forward method of the neural network model to check if the n_agents dimension exists. In this case, I concatenate batch_size and n_agents to train the neural networks and then reshape the resulting q-values to extract the action vector for each environment.

    def forward(self, s, state=None, info={}):
        shape = s.shape
        s = to_torch(s, device=self.device, dtype=torch.float).cuda() 
        if len(s.shape) == 5: #(batch_size, n_agents, n_channels, h, w)
            s = s.view(shape[0] * shape[1], shape[2], shape[3], shape[4]) #(batch_size * n_agents, n_channels, h, w)
            q_values = self.model(s) #(batch_size * n_agents, n_actions)
            q_values = q_values.view(shape[0], shape[1], -1) #(batch_size, n_agents, n_actions)
        return q_values, state

However, this creates a problem on the training side, because the observations are stored with the n_agents dimension in the buffer. For the the training algorithm, that dimension does not exist. For example, in the line 91 of dqn.py (tianshou 0.2.3)

returns = buffer.rew[now] + self._gamma * returns buffer.rew[now] has shape (batch_size, n_agents) and self._gamma (batch_size), so this would break.

What is the best way of addressing this? I foresee two possible strategies 1) Change collector to merge n_batch and n_agents in the Buffer. Only retrieve the n_agents dimension to pass the action vector to the environment in the simulation. The problem is that this would mess with the sequential nature of the samples in the buffer, preventing use of n-step algorithms and invalidating the terminal information (although my environment does not have an end, so this might not be critical). 2) Enable the policy to accept the additional dimension. This would preserve the sequential nature of the buffer, but it would require re-writing some methods.

youkaichao commented 4 years ago

Thank you for your interest in Tianshou.

We are just planning to support marl, so this is a good issue for us to understand the requirement of marl developers. Please take a look at #121 .

Please check if my understanding is correct:

The game you are playing falls into the category of simultaneous move(like moba games), and the simulation side works just fine with current Tianshou. But during training, code breaks in the line of returns = buffer.rew[now] + self._gamma * returns because rew is assumed to be one-dimensional.

If that is the case, I suggest you creating a new DQN class from inheriting the DQN in Tianshou, and override the process_fn method to split the batch and buffer here.

class MobaDQN(DQNPolicy):
    def process_fn(self, batch: Batch, buffer: ReplayBuffer,
                   indice: np.ndarray) -> Batch:
        outputs = []
        for _batch, _buffer in split(batch, buffer):
            output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
            outputs.append(output)
        return Batch.stack(outputs, axis=1)

One possible problem would be that spliting buffer in each process_fn can be costly (splitting batch is unavoidable since each agent has a different return to compute).

From Tianshou side, maybe we can add an option to split the buffer after collecting trajectories, so as to just split buffer once, and pass the splitted buffers to process_fn (refactoring the Collector.sample) . But this would increase the complexity of Collector. Maybe inheriting collector to create a new MobaCollector is a better solution.

In addition, maybe you will encounter another problem: Tianshou 0.2.3 collector does not support vector rewards. This has been fixed in #125 , provided that you have a reward_metric function to convert the vector reward to a scalar metric so as to monitor the training.

p-veloso commented 4 years ago

Yes. Overall your description of the problem is accurate.

If I understand your approach, the loop for _batch, _buffer in split(batch, buffer) will isolate the experiences of each agent in the batch before processing it:.

I think this is a good approach, but I am facing some technical difficulties:

youkaichao commented 4 years ago

I assume that the split function in your suggestion is only pseudocode. I could not find anything in the documentation.

Yes, the split function is only pseudocode. Implementation will be available soon.

I do not understand if the _buffer is a copy of the original buffer or is a version of the buffer only with the information for agent_i

My opinion would be the buffer only with the information for agent_i . As for the prioritized experience replay, this would be complicated. Is the prioritized experience replay different for each agent, or each agent has the same sampling weight? In the former case, prioritized sampling weight can be different for each agent, and it would be better to have totally separated buffers for each agent after experiences are collected. For example:

collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
    collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i

For the latter case, it is very annoying, because importance weights have the shape of [batch_size], but others have shape of [batch_size, n_agent, ...]. This would break down the split. In this case, you have to hack the code yourself.

class MobaDQN(DQNPolicy):
    def process_fn(self, batch: Batch, buffer: ReplayBuffer,
                   indice: np.ndarray) -> Batch:
        outputs = []
        for _batch, _buffer in split(batch, buffer): # split every thing else, but copy importance weight
            output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
            outputs.append(output)
        data = Batch.stack(outputs, axis=1)
        data['imp_weigt'] = ... # aggregate the imp_weigt over the n_agent dimension.

in my computer, the Batch instance does not have shape and does not accept slicing.

Since you are doing something unusual, you have to be very familiar with the structure of your Batch objects, know their keys, split the right keys, avoid indexing empty Batch, and copy the right keys. In general this would be very application dependent and it seems Tianshou can help very little.

youkaichao commented 4 years ago

One possible problem would be that spliting buffer in each process_fn can be costly

In addition, I'm wrong in this point. If arr is a np.ndarray, a[:, i] would share the memory with arr in most cases.

p-veloso commented 4 years ago

Thanks for the quick reply, youkaichao.

Regarding prioritized experience replay, I believe that both cases are similar ... especially if we are traversing all the agents, so whatever is simpler to implement will be adopted. Besides, prioritized experience replay is not essential for my prototype.

Regarding the Batch. I am trying to use the numpy slicing syntax for the split function. However, even in the simple example in the documentation, it breaks and does not have a shape. I will investigate it

In any case, I will experiment this approach using DQN and other algorithms before posting more questions here. Maybe for on-policy algorithms, this will be even simpler.

youkaichao commented 4 years ago

Regarding prioritized experience replay, I believe that both cases are similar

The implementation will be different.

whatever is simpler to implement will be adopted

Agree. It's up to you and heavily depends on your code.

Regarding the Batch. I am trying to use the numpy slicing syntax for the split function. However, even in the simple example in the documentation, it breaks and does not have a shape. I will investigate it.

Keep us informed. If it is still not resolved, you can post the content of your Batch object, maybe it is a bug of Batch or we can point out the problem for you.

In any case, I will experiment this approach using DQN and other algorithms before posting more questions here. Maybe for on-policy algorithms, this will be even simpler.

That's great! Keep us informed, please. We need feedbacks from multi-agent RL developers to understand the requirement in MARL.

p-veloso commented 4 years ago
collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
    collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i

1 The issue with the Batch was solved when I started using version 0.2.4

2 As far as I understand, a collector uses the method collect to get the samples from 1 or more environment. For the sake of simplicity, let's assume that there is only 1 environment. Based on your code, each collector should be restricted to the experience of 1 agent. However, when the collectors[i].collect gets the samples from the environment for all the agents, it should keep the information from agent i and distribute the information for agent 0, 1, ... , i - 1, i + 1, ..., n-1 to the buffers of the other collectors. Is this interpretation correct?

3 I feel that keeping a single collector and using multiple buffers would make more sense, given that the method collect of a single collector necessarily accesses the experience of all the agents - i.e. a step in the environment. The changes in this scenario are:

4 An alternative to 3 would be to replace each single buffer by a class Manager with n buffers. In this case, whenever the collector sends a batch to the buffer, it splits the data for the respective agents. Also, when the collector sample from this manager, it samples from the different buffers. The good thing about this approach is that seems to minimize change in the code, but I do not know if other classes interact with the buffer.

Are these approaches feasible? Do them require changes outside of the classes Collector(3) or ReplayBuffer(4)?

youkaichao commented 4 years ago

Based on your code, each collector should be restricted to the experience of 1 agent. However, when the collectors[i].collect gets the samples from the environment for all the agents, it should keep the information from agent i and distribute the information for agent 0, 1, ... , i - 1, i + 1, ..., n-1 to the buffers of the other collectors.

My proposal is that, you collect experience with one collector first, and after collecting experience, you split the collector into multiple collectors.

p-veloso commented 4 years ago

From the perspective of my own prototype, I think the approaches that I proposed above are too complicated. I will consider your approach in the near future and see if I figure out an efficient way of splitting the the Buffer.

Today, I opted for an easy route for DQN that seems to be working:

  1. I always include the agent dimension in the state, rewards, and actions samples.
  2. I am using a custom MobaNet that can process data with the additional dimension. It merges the batch and agent dimensions before the pass and un-merges these dimensions in the resulting q-values before returning.
  3. I created a class MobaDQN with custom forward, compute_nstep_return, _target_q and learn to make sure that the tensor and array operations consider the additional "n agents" dimension.

I will test it and I will explore other extensions (ex: prioritized replay) and algorithms (ppo, a2c, etc.). I am not sure if we should close this question now or leave it open for future updates.

Thanks for helping me figure out how to customize Tianshou.

Trinkle23897 commented 4 years ago

Seems like resolved. Further discussion can move to #121

Trinkle23897 commented 4 years ago

Hi @p-veloso, today we merged the marl-example into dev branch (#122). Our proposed method and documentation are in https://tianshou.readthedocs.io/en/master/tutorials/tictactoe.html. You can have a check and we are welcome for the feedback!