thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.75k stars 1.12k forks source link

Possible Leak of Observations In Multi-Agent Policies #806

Open uinversion opened 1 year ago

uinversion commented 1 year ago

Hello, I am working with multi-agent policies, and I am facing some issues with rewards and training. Could you please help me with resolving this?

  1. In MARL setting, how to properly freeze the policy of one agent? By freezing, I mean that the policy does not learn and the weights remain fixed, but the policy can still be evaluated. Currently I am doing something like below. Is this correct?
# in PPO.py

def learn(  # type: ignore
        self, batch: Batch, batch_size: int, repeat: int, **kwargs: Any
    ) -> Dict[str, List[float]]:
        losses, clip_losses, vf_losses, ent_losses = [], [], [], []

        if self.frozen:
            return {
                "loss": losses,
                "loss/clip": clip_losses,
                "loss/vf": vf_losses,
                "loss/ent": ent_losses,
            }
        else:
            # continue with learning
  1. I have two agents: agent_1 and agent_2 and they each have 2 states, position and velocity. Every agent can observe the other agent. I have frozen agent_2 (also agent_2 is untrained, so effectively random), and I am only training agent_1. I am finding the case that changing observations for agent_2 affects agent_1. This is unexpected, since agent_2 is frozen, and also randomly initialised, so it can convey no information to agent_1.

Could it be possible that there is a leak in observations of agent_2 to agent_1? For A2C policy, I found that agent_1 received both obs and obs_next. obs is its own observation but obs_next is agent_2's observation, which could be a possible source for the leak. However, I am still uncertain what obs_next really does. Is it a single agent RL attribute?

I would be really grateful if you could help me with this.

Also, just want to mention as a sidenote: I am really liking how brilliantly customisable tianshou is. I recommended it to my colleagues, and they all had positive reviews.

RaffaeleGalliera commented 1 year ago

@uinversion Regarding 2) I noticed the same thing, you could set ignore_obs_next to True when initializing your ReplayBuffer, but then you might not be able to perform n_step estimations and could give problems with the updates anyway. I might be wrong, but I believe Tianshou chains the observations taken by agents sharing the same buffer independently from their ID when buffer.next() by the collector, because the library was designed for single agents only. If that's the case, a better solution would probably be to create a MultiAgentSharedBuffer where you fetch the next observation by matching the ID.

cycfunc commented 1 year ago

@RaffaeleGalliera thank you for the reply, and apologies for the delay in getting back. I tried ignore_obs_next, and I think it does fix the issue of observation leak. But I still was unable to get the agents learning in the multi-agent case (with one frozen agent), although ignore_obs_next=True does seem to work in the strictly single-agent case.

I've actually stumbled onto an even more peculiar problem. I always assumed that observations are independent of the order, i.e if I give the agent the observation [o1, o2, o3] or if I give it the observation [o2, o1, o3], in either case, the agent should be able to learn (of course provided that I am consistent with the order of observations). But somehow, simply changing the order of observations makes the agent unable to learn.

I am unsure why this is happening, and I am currently investigating the learning behaviour of MARL policy and PPO.

cycfunc commented 1 year ago

Update: I've fixed the bug by modifying the preprocessing step in mapolicy.py. The code is present in my fork: https://github.com/Cyclic-Function/tianshou/blob/master/tianshou/policy/multiagent/mapolicy.py

I have only tested the code for 2 agents. I will submit a pull request after testing it for > 2 agent systems.

Trinkle23897 commented 1 year ago

@uinversion very sorry for the very late reply

In MARL setting, how to properly freeze the policy of one agent? By freezing, I mean that the policy does not learn and the weights remain fixed, but the policy can still be evaluated. Currently I am doing something like below. Is this correct?

It's correct, but I think the best way is to modify MAPM.learn to accept an extra mask argument:

    def learn(self, batch: Batch, mask: Optional[list[bool]] = None,
              **kwargs: Any) -> Dict[str, Union[float, List[float]]]:
        results = {}
        mask = mask or [True for _ in range(len(self.policies))]
        for agent_id, policy in self.policies.items():
            data = batch[agent_id]
            if not data.is_empty() and mask[agent_id]:
                out = policy.learn(batch=data, **kwargs)
                for k, v in out.items():
                    results[agent_id + "/" + k] = v
        return results
RaffaeleGalliera commented 1 year ago

@Trinkle23897 Would this approach hold for environments where agents vary at each cycle? Wouldn't it be better and more flexible to create a custom Collector? From my understanding, the default Collector works by stepping agent i while observing agent i+1: an action gets computed for agent i from the policy, and its buffer record is updated, while the observation for agent i+1 is added as a subsequent entry to the buffer (indexed by env id if we have multiple envs). I believe that the leak of observations is caused by the fact that when prompting the next buffer index (let's say to get the next observation in chronological order for agent i) this will probably be the observation of some other agent (let's say agent i+1 from env x) instead of agent i from env x.

Maybe it could be more appropriate to have a VectorReplayBuffer with buffer_num=len(envs)*len(agents), which will divide the total buffer size by that number having a reserved buffer for each agent of each environment, and then use a custom collector which will index the entries in a matrix fashion by agent_id and env_id. What do you think? I am working on my own Collector following this path, and If I am not wrong, a variable number of agents could be easier to handle

Trinkle23897 commented 1 year ago

Wouldn't it be better and more flexible to create a custom Collector?

Of course!

What do you think?

I think it's the right thing to do.