Closed p-veloso closed 4 years ago
Thank you for your interest in Tianshou.
We are just planning to support marl, so this is a good issue for us to understand the requirement of marl developers. Please take a look at #121 .
Please check if my understanding is correct:
The game you are playing falls into the category of simultaneous move
(like moba games), and the simulation side works just fine with current Tianshou. But during training, code breaks in the line of returns = buffer.rew[now] + self._gamma * returns
because rew
is assumed to be one-dimensional.
If that is the case, I suggest you creating a new DQN class from inheriting the DQN in Tianshou, and override the process_fn
method to split the batch and buffer here.
class MobaDQN(DQNPolicy):
def process_fn(self, batch: Batch, buffer: ReplayBuffer,
indice: np.ndarray) -> Batch:
outputs = []
for _batch, _buffer in split(batch, buffer):
output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
outputs.append(output)
return Batch.stack(outputs, axis=1)
One possible problem would be that spliting buffer in each process_fn
can be costly (splitting batch is unavoidable since each agent has a different return to compute).
From Tianshou side, maybe we can add an option to split the buffer after collecting trajectories, so as to just split buffer once, and pass the splitted buffers to process_fn (refactoring the Collector.sample
) . But this would increase the complexity of Collector. Maybe inheriting collector to create a new MobaCollector is a better solution.
In addition, maybe you will encounter another problem: Tianshou 0.2.3 collector does not support vector rewards. This has been fixed in #125 , provided that you have a reward_metric
function to convert the vector reward to a scalar metric so as to monitor the training.
Yes. Overall your description of the problem is accurate.
If I understand your approach, the loop for _batch, _buffer in split(batch, buffer)
will isolate the experiences of each agent in the batch before processing it:.
I think this is a good approach, but I am facing some technical difficulties:
batch_i = batch[:, agent_i]
). However, in my computer, the Batch instance does not have shape and does not accept slicing.I assume that the split function in your suggestion is only pseudocode. I could not find anything in the documentation.
Yes, the split
function is only pseudocode. Implementation will be available soon.
I do not understand if the _buffer is a copy of the original buffer or is a version of the buffer only with the information for agent_i
My opinion would be the buffer only with the information for agent_i . As for the prioritized experience replay, this would be complicated. Is the prioritized experience replay different for each agent, or each agent has the same sampling weight? In the former case, prioritized sampling weight can be different for each agent, and it would be better to have totally separated buffers for each agent after experiences are collected. For example:
collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i
For the latter case, it is very annoying, because importance weights have the shape of [batch_size], but others have shape of [batch_size, n_agent, ...]. This would break down the split. In this case, you have to hack the code yourself.
class MobaDQN(DQNPolicy):
def process_fn(self, batch: Batch, buffer: ReplayBuffer,
indice: np.ndarray) -> Batch:
outputs = []
for _batch, _buffer in split(batch, buffer): # split every thing else, but copy importance weight
output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
outputs.append(output)
data = Batch.stack(outputs, axis=1)
data['imp_weigt'] = ... # aggregate the imp_weigt over the n_agent dimension.
in my computer, the Batch instance does not have shape and does not accept slicing.
Since you are doing something unusual, you have to be very familiar with the structure of your Batch objects, know their keys, split the right keys, avoid indexing empty Batch, and copy the right keys. In general this would be very application dependent and it seems Tianshou can help very little.
One possible problem would be that spliting buffer in each process_fn can be costly
In addition, I'm wrong in this point. If arr
is a np.ndarray, a[:, i]
would share the memory with arr
in most cases.
Thanks for the quick reply, youkaichao.
Regarding prioritized experience replay, I believe that both cases are similar ... especially if we are traversing all the agents, so whatever is simpler to implement will be adopted. Besides, prioritized experience replay is not essential for my prototype.
Regarding the Batch. I am trying to use the numpy slicing syntax for the split function. However, even in the simple example in the documentation, it breaks and does not have a shape. I will investigate it
In any case, I will experiment this approach using DQN and other algorithms before posting more questions here. Maybe for on-policy algorithms, this will be even simpler.
Regarding prioritized experience replay, I believe that both cases are similar
The implementation will be different.
whatever is simpler to implement will be adopted
Agree. It's up to you and heavily depends on your code.
Regarding the Batch. I am trying to use the numpy slicing syntax for the split function. However, even in the simple example in the documentation, it breaks and does not have a shape. I will investigate it.
Keep us informed. If it is still not resolved, you can post the content of your Batch object, maybe it is a bug of Batch or we can point out the problem for you.
In any case, I will experiment this approach using DQN and other algorithms before posting more questions here. Maybe for on-policy algorithms, this will be even simpler.
That's great! Keep us informed, please. We need feedbacks from multi-agent RL developers to understand the requirement in MARL.
collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i
1 The issue with the Batch was solved when I started using version 0.2.4
2 As far as I understand, a collector uses the method collect to get the samples from 1 or more environment. For the sake of simplicity, let's assume that there is only 1 environment. Based on your code, each collector should be restricted to the experience of 1 agent. However, when the collectors[i].collect gets the samples from the environment for all the agents, it should keep the information from agent i and distribute the information for agent 0, 1, ... , i - 1, i + 1, ..., n-1 to the buffers of the other collectors. Is this interpretation correct?
3 I feel that keeping a single collector and using multiple buffers would make more sense, given that the method collect of a single collector necessarily accesses the experience of all the agents - i.e. a step in the environment. The changes in this scenario are:
assert len(self.buffer) == self.env_num
4 An alternative to 3 would be to replace each single buffer by a class Manager with n buffers. In this case, whenever the collector sends a batch to the buffer, it splits the data for the respective agents. Also, when the collector sample from this manager, it samples from the different buffers. The good thing about this approach is that seems to minimize change in the code, but I do not know if other classes interact with the buffer.
Are these approaches feasible? Do them require changes outside of the classes Collector(3) or ReplayBuffer(4)?
Based on your code, each collector should be restricted to the experience of 1 agent. However, when the collectors[i].collect gets the samples from the environment for all the agents, it should keep the information from agent i and distribute the information for agent 0, 1, ... , i - 1, i + 1, ..., n-1 to the buffers of the other collectors.
My proposal is that, you collect experience with one collector first, and after collecting experience, you split the collector into multiple collectors.
From the perspective of my own prototype, I think the approaches that I proposed above are too complicated. I will consider your approach in the near future and see if I figure out an efficient way of splitting the the Buffer.
Today, I opted for an easy route for DQN that seems to be working:
I will test it and I will explore other extensions (ex: prioritized replay) and algorithms (ppo, a2c, etc.). I am not sure if we should close this question now or leave it open for future updates.
Thanks for helping me figure out how to customize Tianshou.
Seems like resolved. Further discussion can move to #121
Hi @p-veloso, today we merged the marl-example into dev
branch (#122). Our proposed method and documentation are in https://tianshou.readthedocs.io/en/master/tutorials/tictactoe.html. You can have a check and we are welcome for the feedback!
I am looking for a simple library to implement parameter sharing in multi-agent RL using single-agent RL algorithms. I have just discovered Tianshou and it looks awesome, but I have a problem with the dimension of the data that represents the n of agents.
My project uses a custom grid-based environment where:
As far as I understand, Tianshou uses the collector both for getting the batches for the simulation (in one or multiple environments) and for retrieving batches for the training. Therefore, the
Notice that the number of samples in a batch from the perspective of the neural network (batch_size * n_agents) is different from the number of samples from the perspective of the environment (batch_size), which can be problematic. In the simulation, the agents should generate a coherent trajectory, so the n_agents dimension is important to indicate which action vectors should be passed to which environment. I can use the forward method of the neural network model to check if the n_agents dimension exists. In this case, I concatenate batch_size and n_agents to train the neural networks and then reshape the resulting q-values to extract the action vector for each environment.
However, this creates a problem on the training side, because the observations are stored with the n_agents dimension in the buffer. For the the training algorithm, that dimension does not exist. For example, in the line 91 of dqn.py (tianshou 0.2.3)
returns = buffer.rew[now] + self._gamma * returns
buffer.rew[now] has shape (batch_size, n_agents) and self._gamma (batch_size), so this would break.What is the best way of addressing this? I foresee two possible strategies 1) Change collector to merge n_batch and n_agents in the Buffer. Only retrieve the n_agents dimension to pass the action vector to the environment in the simulation. The problem is that this would mess with the sequential nature of the samples in the buffer, preventing use of n-step algorithms and invalidating the terminal information (although my environment does not have an end, so this might not be critical). 2) Enable the policy to accept the additional dimension. This would preserve the sequential nature of the buffer, but it would require re-writing some methods.