Open uinversion opened 1 year ago
@uinversion Regarding 2) I noticed the same thing, you could set ignore_obs_next
to True when initializing your ReplayBuffer, but then you might not be able to perform n_step
estimations and could give problems with the updates anyway. I might be wrong, but I believe Tianshou chains the observations taken by agents sharing the same buffer independently from their ID when buffer.next()
by the collector, because the library was designed for single agents only. If that's the case, a better solution would probably be to create a MultiAgentSharedBuffer
where you fetch the next observation by matching the ID.
@RaffaeleGalliera thank you for the reply, and apologies for the delay in getting back. I tried ignore_obs_next
, and I think it does fix the issue of observation leak. But I still was unable to get the agents learning in the multi-agent case (with one frozen agent), although ignore_obs_next=True
does seem to work in the strictly single-agent case.
I've actually stumbled onto an even more peculiar problem. I always assumed that observations are independent of the order, i.e if I give the agent the observation [o1, o2, o3] or if I give it the observation [o2, o1, o3], in either case, the agent should be able to learn (of course provided that I am consistent with the order of observations). But somehow, simply changing the order of observations makes the agent unable to learn.
I am unsure why this is happening, and I am currently investigating the learning behaviour of MARL policy and PPO.
Update: I've fixed the bug by modifying the preprocessing step in mapolicy.py. The code is present in my fork: https://github.com/Cyclic-Function/tianshou/blob/master/tianshou/policy/multiagent/mapolicy.py
I have only tested the code for 2 agents. I will submit a pull request after testing it for > 2 agent systems.
@uinversion very sorry for the very late reply
In MARL setting, how to properly freeze the policy of one agent? By freezing, I mean that the policy does not learn and the weights remain fixed, but the policy can still be evaluated. Currently I am doing something like below. Is this correct?
It's correct, but I think the best way is to modify MAPM.learn to accept an extra mask
argument:
def learn(self, batch: Batch, mask: Optional[list[bool]] = None,
**kwargs: Any) -> Dict[str, Union[float, List[float]]]:
results = {}
mask = mask or [True for _ in range(len(self.policies))]
for agent_id, policy in self.policies.items():
data = batch[agent_id]
if not data.is_empty() and mask[agent_id]:
out = policy.learn(batch=data, **kwargs)
for k, v in out.items():
results[agent_id + "/" + k] = v
return results
@Trinkle23897 Would this approach hold for environments where agents vary at each cycle? Wouldn't it be better and more flexible to create a custom Collector? From my understanding, the default Collector works by stepping agent i while observing agent i+1: an action gets computed for agent i from the policy, and its buffer record is updated, while the observation for agent i+1 is added as a subsequent entry to the buffer (indexed by env id if we have multiple envs). I believe that the leak of observations is caused by the fact that when prompting the next buffer index (let's say to get the next observation in chronological order for agent i) this will probably be the observation of some other agent (let's say agent i+1 from env x) instead of agent i from env x.
Maybe it could be more appropriate to have a VectorReplayBuffer
with buffer_num=len(envs)*len(agents)
, which will divide the total buffer size by that number having a reserved buffer for each agent of each environment, and then use a custom collector which will index the entries in a matrix fashion by agent_id
and env_id
. What do you think? I am working on my own Collector following this path, and If I am not wrong, a variable number of agents could be easier to handle
Wouldn't it be better and more flexible to create a custom Collector?
Of course!
What do you think?
I think it's the right thing to do.
[x] I have mentioned version numbers, operating system and environment, where applicable:
Hello, I am working with multi-agent policies, and I am facing some issues with rewards and training. Could you please help me with resolving this?
Could it be possible that there is a leak in observations of agent_2 to agent_1? For A2C policy, I found that agent_1 received both
obs
andobs_next
.obs
is its own observation butobs_next
is agent_2's observation, which could be a possible source for the leak. However, I am still uncertain whatobs_next
really does. Is it a single agent RL attribute?I would be really grateful if you could help me with this.
Also, just want to mention as a sidenote: I am really liking how brilliantly customisable tianshou is. I recommended it to my colleagues, and they all had positive reviews.