Open vmoens opened 1 year ago
Hi there. We are building a multi-agent environment based on Nvidia Isaac Gym and want to integrate it with torchrl somehow. So we are more than glad to help with 2. and 3., and thus would like to know your idea on what a multi-agent interface should look like and how and where ensembling could take place accordingly.
The solution should be general enough to cover the following cases:
The first to-be-addressed questions I can think of:
Env
: what are the rules for defining the tensor_spec
s for multiple agents?Collector
: how should we structure a joint policy and perform data collection?Specifically for the example/tutorial, I think multi-agent PPO with actors of the same structure (so we can showcase model ensembling) and a single centralized critic could be a good one.
What do you think?
This sounds wonderful. We're actively working with @matteobettini on refactoring the API to make it more MARL friendly. We're also getting support from Nvidia to integrate Isaac sim. The stars are clearly aligned 😀
The ambition I have for MARL and multitask is to treat envs like we treat tensors: being able to stack them, or perhaps in some cases to index them. The specs would follow a similar logic: you can already expand specs, in the future I'd like to be able to stack them (even if heterogeneous, like we do with tensordict). I'll ping you as the work goes on, I'd love to get feedback on these features.
On your side let me know how I can help!
@btx0424 That is also my goal! I especially like that you mentioned heterogeneus agents which is a particularly important for me.
Here is a list of issues me and @vmoens have focused on to pave the way to multi-agent #777 . Some of them are still open, such as multi-agent collectors.
Happy to discuss the next steps and curious about any suggestion
I can see that you have made impressive efforts to keep everything in a (nested-) tensor-like structure, and tensor_spec
is currently the next step.
After reading through #777, the things that come into my mind are:
done
in addition to the environment-level done
: There are cases that some agents die before others, e.g., a unit gets destroyed in SMAC, where the later invalid transitions should not be used for training.[n_vec_envs, n_agents]
may have limitations. It is common in MARL to provide some privileged information (e.g., a global state) to a centralized critic. Generally, the shapes are still only homogeneous up to [n_vec_envs, *]
.As I understand, here we are faced with whether to provide an AgentSpec
abstraction.
spec
s without the prepended [*, n_agents]
dimensions. It is easier to expand the agent level spec to Env level spec than the other way around.TensorDictModule
, a natural way to think of agent-policy identification is to name the agents and set in_keys
and out_keys
correspondingly. This could be easily achieved with a Transform
from the stacking-and-indexing way (and vice versa), though.AgentSpec
or similar abstraction at the environment level then it makes less sense to have env.n_agents
and hence the batch size [n_vec_envs, n_agents]
.A solution would be providing an AgentSpec
abstraction with which individual specs are specified. And later we expand and stack them in EnvBase.__setattr__
for the user upon self.some_agent=AgentSpec(...)
, ensuring everything looks correct.
Nonetheless, I'm definitely a fan of the stacking-and-indexing way now. In MARL we sometimes want a policy pool or something like the league training in AlphaStar. Essentially it is executing policies with the same structure but different params or states. Imagine we can
SubEnv
s to different SubCollector
sSubCollector
s share the same policy structure, but just should be using different paramsSubCollector
s and do parallel inference with functorch.vmap
for
loop to perform operations that could be parallelizedWe should allow per-agent done in addition to the environment-level done: There are cases that some agents die before others, e.g., a unit gets destroyed in SMAC, where the later invalid transitions should not be used for training.
Does this imply something different than a done state with a shape that matches the env shape?
It is common in MARL to provide some privileged information (e.g., a global state) to a centralized critic.
that's an interesting point. IIUC you're saying that an env with a batch size [B]
should be able to send and receive information that has an empty shape[]
?
Can we get around this by expanding the common information? torch.expand
does not allocate memory so that should not be an issue. But I may be missing the point here
provide an AgentSpec abstraction
I'm not entirely sure of what this would look like.
Here's my take on envs and specs: beyond field-specific commitments, I think that it is natural to think about populations of envs as something that we can group with torch. Stack
. #892 will provide this functionality for specs. Eventually, I would like to be able to stack and index specs in a way that feels natural. That could also allow us to build training loops where agents are dynamically added or removed (we're not there yet but it would be cool).
I'm not sure I see what an AgentSpec
would be? Can you elaborate of what it would contain? Would it be strictly a MARL thing?
Agent
is a bit overused and I'm always confused about what it means (sometimes ppl talk about an agent as an algorithm, sometimes as an actor, sometimes as something completely different), hence I try to avoid it in torchrl. If we have a crystal clear definition on what it is and what it isn't i'm happy to adopt it ;)
Also, If not, the users themselves have to ensure the ordering of the stacking and later do correct indexing. Overall this become confusing for inexperienced users.
This makes me curious. Why would it be confusing? What is the alternative? I know sometimes agents are sort of "named" in a dictionary but that seems hardly scalable and reusable to me. Also, it is a strong commitment to MARL: even though I'm super excited about supporting MARL I would like that someone doing something completely different (say, offline RL or bandits) is not faced with things that originate from other domains. Again, I'm not saying we don't want our API to be challenged, on the contrary, but I'm wondering what the alternative would be. I tend to think that the concept of indexable envs can group together MARL, multi-task, or even just distributed RL.
Anyhow: I'm super excited to see how productive these conversations are!
cc @PaLeroy FYI
@btx0424 thanks for taking the time to read through everything. You made many interesting points, let me start addressing a few.
I also really like the point of being able to assign different policies to different sub-envs , sub-tasks or agents. So this would mean passing to a collector a bunch of policies each with a slice of the env.batch_size it should operate on.
This dilemma of having the agents in the batch_size or not has been following me for a while and I am curious to get everyone’s thoughts early on to see if it is worth going this way ors just having the agents as a dimension oitside the batch_size. Here are the pro and cons i can think of:
I also really like the point of being able to assign different policies to different sub-envs , sub-tasks or agents. So this would mean passing to a collector a bunch of policies each with a slice of the env.batch_size it should operate on.
This is supported by TensorDictSequential provided that agents have observation keys that differ: the module will route the observations to the appropriate policy as needed. It also supports parameter tying and vmap.
Replaying to @vmoens and #897.
Are you saying that, in your opinion, the first thing someone should be taught about is how to build an environment from scratch?
I wouldn't say that. I think the agent shouldn't have any contact with the env code variable. And that the env shouldn't have to be derived from a base class. Like in the tutorial in the following code:
def make_ddpg_actor(
stats,
device="cpu",
):
proof_environment = make_transformed_env(make_env(), stats)
env_specs = proof_environment.specs
out_features = env_specs["action_spec"].shape[0]
...
distribution_kwargs={
"min": env_specs["action_spec"].space.minimum,
"max": env_specs["action_spec"].space.maximum,
},
I think the agent should not get the entire env object but just the bare minimum of information it needs to construct the networks - the state and action shape and maybe the action min/max(but maybe not even that, it could just output [-1, 1] and then it would be up to the user to put that into the correct range). This would eliminate the need for the environment to be changed to fit a certain interface and would put all the responsibility on the programmer which I find more intuitive.
I would guess that many newcomers would like to get their hands on a more "concrete" and pre-packed problem like solving an existing task (ie a gym environment or similar).
Hmm. I actually don't know. You might be right. Personally, I feel like I've always had an environment I wanted to work on and found the Gym envs annoying to install. I think I'd appreciate more an example that assumes an environment has been loaded already and just applies the agent to it. I know when I was reading torch tutorials that I also never liked loading the built-in datasets like MNIST but liked creating my own data. But to each his own and I might just be weird.
We can definitely do a better job at documenting the single features of the library. It's sort of a rubiks cube, where you solve one face but it messes up the face on the other side:
- If we do tutorials/examples about single features, it is confusing for newcomers as they need to go through multiple tutos to understand how to code a simple "go from A to B" policy.
- If we go for the full tutorial, there are quickly too much info to be digested in one go.
I totally get what you're saying. I think little examples directly in the docs page of the object would be the best. Just like how to instantiate the object and how to call it or something. For example like this docs page for MSELoss.
I hope that helped. And thanks for your time, I really appreciate it.
I think the agent should not get the entire env object but just the bare minimum of information it needs to construct the networks - the state and action shape and maybe the action min/max(but maybe not even that, it could just output [-1, 1] and then it would be up to the user to put that into the correct range). This would eliminate the need for the environment to be changed to fit a certain interface and would put all the responsibility on the programmer which I find more intuitive.
Right, but don't read too much in that helper function. This is a simple code snippet aimed at creating an env in a specific context, ie it is not part of the library as a core component. Again: we could do a better job at explaining how to use the primitives, but helper functions should not be seen as indications of how to build a training pipeline. They are merely there to tie the pieces together.
Hmm. I actually don't know. You might be right. Personally, I feel like I've always had an environment I wanted to work on and found the Gym envs annoying to install. I think I'd appreciate more an example that assumes an environment has been loaded already and just applies the agent to it. I know when I was reading torch tutorials that I also never liked loading the built-in datasets like MNIST but liked creating my own data. But to each his own and I might just be weird.
Good to know. I'm working on a "make your own env" tutorial, without the need to install any other library. I'll ping you when it's done
The picture is getting clearer. I think it's possible to make a unifying API with some effort.
Let's address the problems one by one. Throughout the following discussion, we assume we are dealing with a vectorized multi-agent environment with E envs, two types of agents A1 and A2, and an observation_dim D.
Regarding the batch_size
of an Env, I meant something like
import torch
from torchrl.data import CompositeSpec, UnboundedContinuousTensorSpec
E, (A1, A2), D = 1024, (2, 3), 16
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2))) # should go to a centralized critic
], dim=0).expand(E, A1+A2+1)
obs_spec.shape # torch.Size([1024, 6])
Another option is to have a separate Env.state_spec
. But either way, we are not getting TensorDict
s of shape [E, A1+A2]
. Here state
is not common information for the actors but a privileged information for the critic, so torch.expand
does not intuitively solve the issue.
But now I think this is not really a problem and I'll try to show why later. Let's drop the state and proceed. With the above spec
, assuming A1 and A2 have the same actor model structure, a wonderful way to construct the joint policy is to stack individual policies. For simplicity, we use the normal nn.Module
instead of TensorDictModule
.
import torch
import torch.nn as nn
import functorch
from torchrl.data.tensor_specs import UnboundedContinuousTensorSpec
from tensordict import TensorDict
from tensordict.nn import make_functional
from typing import Tuple
from copy import deepcopy
E, (A1, A2), D = 1024, (2, 3), 16
input_spec = torch.stack([
*[UnboundedContinuousTensorSpec(D) for _ in range(A1)],
*[UnboundedContinuousTensorSpec(D) for _ in range(A2)],
], dim=0)
def stack(modules: Tuple[nn.Module, ...]):
if not len(set(m.__class__ for m in modules)) == 1:
raise ValueError("Currently only stacking homogeneous modules is supported.")
params = torch.stack([make_functional(deepcopy(m)) for m in modules])
m = deepcopy(modules[0])
make_functional(m)
func = functorch.vmap(m) # a lot to do here to ensure the desired behavior
# we might want to assign a `shape` to the "stacked" func
return func, params
x = input_spec.rand()
a1 = nn.Linear(D, 1)
a2 = nn.Linear(D, 1)
joint_policy, params = stack([
*[a1 for _ in range(A1)],
*[a2 for _ in range(A2)]
])
y1 = joint_policy(x, params)
y2 = torch.cat([
a1(x[:A1]),
a2(x[A1:])
])
assert torch.allclose(y1, y2, atol=1e-7) # the results are slightly different, though
Ideally, this parallel inference could provide a considerable speedup. It also saves the trouble of having to index the input to do policy-agent matching.
I also really like the point of being able to assign different policies to different sub-envs , sub-tasks or agents. So this would mean passing to a collector a bunch of policies each with a slice of the env.batch_size it should operate on.
This is supported by TensorDictSequential provided that agents have observation keys that differ: the module will route the observations to the appropriate policy as needed. It also supports parameter tying and vmap.
TensordictSequential
is of course capable of the job if we index-and-name the input, but the operations are sequential. I think the above share the same spirit with and is the first step to stacking sub-collectors.
As for the AgentSpec
abstraction, I have been doing something similar to
from torchrl.envs import EnvBase as _EnvBase
from torchrl.data import TensorSpec
from dataclasses import dataclass
from typing import List
@dataclass
class AgentSpec:
name: str
n_agents: int
observation_spec: TensorSpec
action_spec: TensorSpec
reward_spec: TensorSpec
class EnvBase(_EnvBase):
agent_specs: List[AgentSpec] = []
def register_agent(
self, agent_spec: AgentSpec
):
# completely optional
# 1. the named-agents way
self.agent_specs.append(agent_spec)
def expand(spec: TensorSpec):
return spec.expand(*self.batch_size, agent_spec.n_agents, *spec.shape)
self.observation_spec[f"{agent_spec.name}.obs"] = expand(agent_spec.observation_spec)
self.action_spec[f"{agent_spec.name}.action"] = expand(agent_spec.action_spec)
self.reward_spec[f"{agent_spec.name}.reward"] = expand(agent_spec.reward_spec)
# 2. the StackedSpec way
# not sure what it exactly looks like for now, but I believe it is doable and makes good sense
# so that we can do
env = EnvBase(...)
policies = []
for agent_spec in env.agent_specs:
policies.extend([make_model(agent_spec) for _ in range(agent_spec.n_agents)]) # the more natural individual view
joint_policy = torch.stack(policies)
Putting things together, I think you can see my point about how it makes policy construction more straightforward. By design, we can make this completely optional in case the user wants to specify everything manually. So it won't trouble people outside the MARL context.
But now I think this is not really a problem and I'll try to show why later.
Up to now, we have been assuming a homogeneous setting. However, once we can generalize to heterogeneous cases where we can correctly stack nn.Modules
with different structures, it is as straightforward as
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2))) # should go to a centralized critic
], dim=0).expand(E, A1+A2+1)
actor_critic = stack([
*[make_actor(...) for _ in range(A2)],
*[make_actor(...) for _ in range(A1)],
make_critic(...)
])
I have been really getting inspiration from this project and the discussion. It would be great to get started by making an example/tutorial with VMAS once #892 is ready and see what problems might pop out. What do you think?
(btw it looks like cat
may come more convenient than stack
in the above examples haha)
Regarding the
batch_size
of an Env, I meant something likeimport torch from torchrl.data import CompositeSpec, UnboundedContinuousTensorSpec E, (A1, A2), D = 1024, (2, 3), 16 obs_spec = torch.stack([ *[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)], *[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)], CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2))) # should go to a centralized critic ], dim=0).expand(E, A1+A2+1) obs_spec.shape # torch.Size([1024, 6])
Another option is to have a separate
Env.state_spec
. But either way, we are not gettingTensorDict
s of shape[E, A1+A2]
. Herestate
is not common information for the actors but a privileged information for the critic, sotorch.expand
does not intuitively solve the issue.
I think you might have lost me here already. Where is D gone? How does that expand shape fit? Are you proposing a batch dimension of n_agents +1
?
Sorry for the delayed reply. The example was coded with the implementation of #892 at the time. I went back and found that currently, the behavior of stacking CompositeSpec
is different from that of non-composite ones:
from torchrl.data import CompositeSpec, UnboundedContinuousTensorSpec
E, (A1, A2), D = 1024, (2, 3), 16
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D)) for _ in range(A2)],
# CompositeSpec(state=UnboundedContinuousTensorSpec(D * (A1+A2)))
], dim=0)
print(obs_spec.shape) # [A1+A2], so obs_spec.exand(E, A1+A2) works
input_spec = torch.stack([
*[UnboundedContinuousTensorSpec(D) for _ in range(A1)],
*[UnboundedContinuousTensorSpec(D) for _ in range(A2)],
], dim=0)
print(input_spec.shape) # [A1+A2, D], have to do input_spec.expand(E, A1+A2, D)
So D is indeed gone in the composite case. I'm unsure which makes more sense here, but I would vote for the first.
Are you proposing a batch dimension of
n_agents +1
?
No. It is just an example of what commonly happens when we want to provide some additional input that is not of shape [n_agents, *]
. This information may or may not be for the actors. So expanding it to [n_agents, *]
does not solve the issue in a general way.
So I am proposing:
Env
's batch size to be [n_vec_envs]
for flexibility. In this case, Collectors
are counting environment steps. "Solving a multi-agent game in X environment steps" is a fair way to talk about sample efficiency in the literature and makes enough sense to me. obs_spec
's shape to the result of stacking. Think of the joint policy for rollout as stacked Module
s that is consistent with the spec
s in some way. Apply ensembling to facilitate parallel inference if possible.Regarding stacking Module
s we will need more opinions and help from @vmoens. Assuming agents of the same type share the same actor network, the example I used is memory-inefficient as it makes copies of the same parameters. Also we will have to manully collect the gradients when training.
Ideally it should look like:
a1 = nn.Linear(D, 1)
a2 = nn.Linear(D, 1)
policy = torch.cat([
a1.expand(A1),
a2.expand(A2),
]) # returns a lazily concat-ed object so tha we don't actually copy those parameters
policy.shape # [A1+A2]
So D is indeed gone in the composite case. I'm unsure which makes more sense here, but I would vote for the first.
bouncing back on this: it is only gone because you did not assign a shape to CompositeSpec
. If you do
obs_spec = torch.stack([
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D), shape=[D]) for _ in range(A1)],
*[CompositeSpec(obs=UnboundedContinuousTensorSpec(D), shape=[D]) for _ in range(A2)],
], dim=0)
the behaviour is consistent.
For stacking layers, the way we approach this is the following
module1 = TensorDictModule(a1, in_keys=["key1"], out_keys=["embed"])
module2 = TensorDictModule(a2, in_keys=["key2"], out_keys=["embed"])
module = TensorDictSequential(module1, module2, partial_tolerant=True)
data = torch.stack([
TensorDict({"key1": torch.randn(D)}, []),
TensorDict({"key2": torch.randn(D)}, [])
], 0)
module(data)
When TensorDictSequential
will not find "key1"
in your lazy stacked TensorDict, it will decompose it and look for "key1"
.
In multi-agent settings, you will just need to set a different set of key names for each agent, collect them in a lazy stacked td and pass that to a TensorDictSequential.
That way we can avoid to code stacks of Linear layers, convolutional layers, etc.
Wdyt?
Sorry for the delayed reply. I think that works for me.
I know sometimes agents are sort of "named" in a dictionary but that seems hardly scalable and reusable to me.
In multi-agent settings, you will just need to set a different set of key names for each agent, collect them in a lazy stacked td and pass that to a TensorDictSequential.
Guess in the end we have to give agents names somewhere. But you've managed to make things as easy as possible. That's great!
I will digest this a bit and try to work out a multi-agent PPO example/tutorial that looks similar to the PPO tutorial and demonstrates the ideas we have discussed above.
Motivation
Plan for the doc revamp