ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.31k stars 5.64k forks source link

[rllib] Custom model for multi-agent environment: access to all states #7341

Closed janblumenkamp closed 1 year ago

janblumenkamp commented 4 years ago

What is your question?

My goal is to learn a single policy that is deployed to multiple agents (i.e. all agents learn the same policy, but are able to communicate with each other through a shared neural network). RLlib's multi-agent interface works with the dict indicating an action for each individual agent.

It is not entirely clear to me how my custom model is supposed to obtain the current state after the last time-step for all agents at once (it appears to me that RLLib calls the forward-function in my subclass inherited from TorchModelV2 for each agent individually and passes the state for each agent into the state argument of the forward function).

tl;dr, if this is my custom model:

class AdaptedVisionNetwork(TorchModelV2, nn.Module):
    """Generic vision network."""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs, model_config, name)
        nn.Module.__init__(self)
        # ... NN model definition

    @override(TorchModelV2)
    def forward(self, input_dict, state, seq_lens):
        features = self.predict(input_dict["obs"].float())
        logits = self._logits(features)
        self._cur_value = self._value_branch(features).squeeze(1)
        return logits, state

    @override(TorchModelV2)
    def value_function(self):
        assert self._cur_value is not None, "must call forward() first"
        return self._cur_value

Then how do I manage to predict the logits for all of my n agents at once while having access to the current state of all my agents? Am I supposed to use variable-sharing? Is #4748 describing this exact problem? If so, is there any progress?

janblumenkamp commented 3 years ago

To add to this, as another working example, this is the project/repository which is the result of this thread from me.

As a working minimal example with a more recent Ray version, I have created this repository. It's a toy problem that serves as a reference implementation for the changes that are due to be done in RLlib. I talked to Sven recently and the plan is to hopefully get this done over the next few weeks :)

EDIT: Just an update regarding my minimal example, it now supports both continuous and discrete action spaces and I have cleaned up the trainer implementation quite a bit, should be much clearer now. Let me know if you have any questions.

Rohanjames1997 commented 3 years ago

Hi @ericl @janblumenkamp. This whole thread was very helpful, thanks for the detailed explanations from both of you!

I am currently in the process of migrating a project to the RLlib framework, and I had some doubts about some of the points in your discussion. Here's some context before I begin:

My doubts revolve around the Agent grouping mechanism

or is the grouped super-agent literally treated as one big agent with a huge observation and action space

It's the latter, it really is one big super-agent. You could potentially still do an architectural decomposition within the super agent model though (i.e., to emulate certain multi-agent architectures).

  1. If I have a single policy for all my agents, is it necessary to provide a grouping mechanism? I'm guessing it isn't required. 1.1 In this case, would I be allowed to have a varying number of agents?
  2. For this homogenous policy case, what exactly is the benefit of subclassing the MultiAgentEnv class? Is it somehow related to the part of the blog where it says this? 👇

First, decomposing the actions and observations of a single monolithic agent into multiple simpler agents not only reduces the dimensionality of agent inputs and outputs, but also effectively increases the amount of training data generated per step of the environment.

  1. If I did have > 1 policies, would the number of agents per policy be constrained by the superagent mechanism of concatenating agents? Are there cases where number of agents per policy can be dynamic?

Thank you so much again! I can't wait to onboard to RLlib!

janblumenkamp commented 3 years ago

Hi @Rohanjames1997! Have a look at the discussion further down in this thread. You can't use the MultiAgentEnv and also grouping does not help if you want to run backpropagation through communication. Check out my minimal example: https://github.com/janblumenkamp/rllib_multi_agent_demo It involves many ugly hacks (most notably, formulating the multi-agent env as one standard gym super-observation and super-action space that contains the observations and actions for a fixed number of agents - in your case, maybe you can just mask the agents you don't need out - and also passing rewards for each agent through the info dict to the trainer). Will update it soon to Ray 1.3.0!

Rohanjames1997 commented 3 years ago

Hi @janblumenkamp ! Thank you so much for the link! I had missed that part of the discussion. I shall probably implement something very similar.

Assuming I had no inter-agent communication, could you answer my previous questions?

And an additional one: Since #10884 is still in progress, is it right to say that RLlib's MultiAgentEnv class currently does not support Graph neural networks? (Since GNN's involve communications by default)

Thanks again! And congratulations on the paper! It was a great read! 😄