ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.76k stars 5.74k forks source link

How should state and action spaces, states, action, reward, and done be defined in Multi-Agent environment? #6875

Closed oroojlooy closed 4 years ago

oroojlooy commented 4 years ago

Ray version and other system information (Python version, TensorFlow version, OS): I have Debian 8.7, python 3.7.4, Tensorflow 2.0.1.

What is your question?

I have a custom multi-agent environment which uses spaces from gym to define observaion_space and discrete action_space, and includes reset and step functions. This environment can have any agents greater than one. I am going to use this environment in ray and run different implemented multi-agent algorithms there. Now, to use it with ray, I tried my current environment and did not work with some weird error for not being able to read state. So, I found that I actually do not know how to define to them in my environment to be match with ray. So, I have some questions about this:

  1. How should be the action_space? For example one can think of a list of gym.spaces.Discrete(.) like [gym.spaces.Discrete(.), gym.spaces.Discrete(.), ...] , or it can be a dictionary of those spaces like {0: gym.spaces.Discrete(.), 1: gym.spaces.Discrete(.), ...}, or can be a gym.spaces.Multidiscrete(.). Which one is preferred by ray?
  2. Same question is valid for the observation_space.
  3. When we call the env.reset(), it returns the state and it can be a list or a dictionary. This state has to be provided to ray to run a RL algorithm. Again, how should it be? a list or dictionary? e.g. [s^0_0, s^0_1, s^0_2, ....] or {0: s^0_0, 1: s^0_1, 2: s^0_2, ....} or something else?
  4. Same question is valid for the action, reward, and done.

Please let know if I need to add more details for each question.

Thanks in advance, Afshin

rusu24edward commented 4 years ago

Here are few resources that have helped me better understand the MultiAgentEnv:

https://ray.readthedocs.io/en/latest/rllib-env.html https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py#L7 https://github.com/ray-project/ray/blob/master/rllib/examples/multiagent_two_trainers.py https://github.com/ray-project/ray/blob/master/rllib/examples/multiagent_cartpole.py https://github.com/ray-project/ray/blob/master/rllib/env/multi_agent_env.py

rusu24edward commented 4 years ago

In particular, in looking at the MultiAgentEnv interface documentation, we should notice a few things between MultiAgentEvnv and gym.Env.

Similarities:

  1. We must implement the reset and step functions.

Differences:

  1. We do not have to define an observation_space or an action_space in the environment. That doesn't actually exist in the MultiAgentEnv interface. Our desire to create that in the environment is a residual from using gym.Env. In my opinion, this example is a little misleading because it can lead us to think that a MultiAgentEnv has to have a single observation_space and single action_space defined. But this is not true, as we see here: the traffic lights and the cars have different action and observation spaces, yet interact in the same MultiAgentTrafficEnv. So where does the action and observation space get defined for each agent? The answer is: in the policy. When you define the policy and map the agent_id to that policy, you are defining the action and observation space for that agent.
  2. As you hint at in your question, everything becomes a dict. Actions, observations, rewards, dones, and infos are all dictionaries that key off the agent's id. So the action and observations are a mapping from the agent's id to a gym.space. The rewards are a mapping from the agent's id to a scalar. The dones are a mapping from the agent's id to a boolean, with the addition of the 'all' key that needs to be defined in step see here. And the info is a mapping from the agent's id to a dictionary containing info for that specific agent.

So it seems to me that we need to shed off the way of thinking about multi agent environments as extensions of gym environments because they seem to be quite different.

oroojlooy commented 4 years ago

Thanks @rusu24edward for the explanation. I agree with you that the MultiCartPole example is misleading, and we need to define agents who inherit form gym, and multi-agent env only needs to inherit from MultiAgentEnv. As I explained I have defined the state, action, reward, and done as dictionaries. Inside each state, the observation is saved by a list. My main problem is to define the observation_space and action_space. I do not want to define them when I call the policy, since the environment is going to be called from other packages too.

Right now, the state of the problem looks like { 0: [0.0, 0.0, 0.0, 0.0, 0.0], 1: [0.0, 0.0, 0.0, 0.0, 0.0], 2: [0.0, 0.0, 0.0, 0.0, 0.0], 3: [0.0, 0.0, 0.0, 0.0, 0.0]}. When I define a dictionary of gym.spaces, e.g. {0: Box(.), 1: Box(.), ....}, I get this error:

Traceback (most recent call last):
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 351, in fetch_result
    result = ray.get(trial_future[0])
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ESC[36mray_PPO:train()ESC[39m (pid=22333, host=polyp30)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 90, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 372, in __init__
    Trainable.__init__(self, config, logger_creator)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 96, in __init__
    self._setup(copy.deepcopy(self.config))
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 492, in _setup
    self._init(self.config, self.env_creator)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 109, in _init
    self.config["num_workers"])
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 537, in _make_workers
    logdir=self.logdir)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 64, in __init__
    RolloutWorker, env_creator, policy, 0, self._local_config)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 220, in _make_worker
    _fake_sampler=config.get("_fake_sampler", False))
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 348, in __init__
    self._build_policy_map(policy_dict, policy_config)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 741, in _build_policy_map
    obs_space, merged_conf.get("model"))
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/models/catalog.py", line 367, in get_preprocessor_for_space
    cls = get_preprocessor(observation_space)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/models/preprocessors.py", line 254, in get_preprocessor
    legacy_patch_shapes(space)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/models/preprocessors.py", line 290, in legacy_patch_shapes
    return space.shape
AttributeError: 'dict' object has no attribute 'shape'

Similarly, when I create a single one, e.g. Box(shape=(number_of_agents, stateDim, ) ), I get ValueError: ('Observation outside expected value range', Box(4,5), array([0., 0., 0., 0., 0.])):

Traceback (most recent call last):
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 351, in fetch_result
    result = ray.get(trial_future[0])
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ESC[36mray_PPO:train()ESC[39m (pid=9308, host=polyp30)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 418, in train
    raise e
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 407, in train
    result = Trainable.train(self)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 140, in step
    self.num_envs_per_worker, self.train_batch_size)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/optimizers/rollout.py", line 29, in collect_samples
    next_sample = ray_get_and_free(fut_sample)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
    result = ray.get(object_ids)
ray.exceptions.RayTaskError(ValueError): ESC[36mray_RolloutWorker:sample()ESC[39m (pid=9292, host=polyp30)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 469, in sample
    batches = [self.input_reader.next()]
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 56, in next
    batches = [self.get_data()]
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 99, in get_data
    item = next(self.rollout_provider)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 319, in _env_runner
    soft_horizon, no_done_at_end)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 407, in _process_observations
    policy_id).transform(raw_obs)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/models/preprocessors.py", line 166, in transform
    self.check_shape(observation)
  File "/scratch/afo214/anaconda3/lib/python3.7/site-packages/ray/rllib/models/preprocessors.py", line 65, in check_shape
    self._obs_space, observation)
ValueError: ('Observation outside expected value range', Box(4, 5), array([0., 0., 0., 0., 0.]))

Any idea how to fix this issue?

rusu24edward commented 4 years ago

Regarding the first error, I believe the observation_space and action_space for each policy must be a gym.spaces object. It looks like you are attempting to pass a dictionary of id's mapping to gym.spaces objects as the observation_space for a single policy. This won't work. Each policy must have a gym.space object. You can create a single policy like this and then map a bunch of agents to that policy. For example, the traffic light env shows that all traffic light agents map to the same policy and all cars map to a random selection of car-policies. But the actual policies themselves are just defined with a gym.spaces object for the observation_space and the action_space.

Regarding the second error, I can think of two things. First, if you really are doing Box(shape=(4,5)), then this should produce an error because gym Box requires the low and high arguments. Secondly, I tend to flatten my observation/action spaces into single-dimensional entities. This is something I picked up from using stable-baselines, which required it. I'm no sure if rllib requires this, but it is worth looking into.

I'm not sure what you mean by agents needing to inherit from gym. The gym interface is for environments, and typically we think of agents as algorithms that learn a policy by interacting with the environments. The algorithms don't need to inherit from gym. They just expect the interface and underlying datastructures (gym.spaces).

Hope this helps!

oroojlooy commented 4 years ago

@rusu24edward As I mentioned I prefer to include the observation_space and action_space as part of the environment properties since I want to use the env with other packages than ray too. So, having passed the observation_space and action_space as it is in the traffic light env example does not work for me. So, you mean that there is no way of having what I want, right?

Beside, in the second approach, I have the rest of the required input parameters to define a Box, e.g.,spaces.Box(low=0, high=10, shape=(config.NoAgent, config.stateDim,), dtype=np.float32). So, the probably the problem is that it looks for a single input for all agent, though I am passing a dictionary of states for all agents.

About the inheritance, I did not mean the RL agent which learns the policy, I meant the agents in the environment, which hold the state, take the action and play it, and return the new state, reward, and done.

rusu24edward commented 4 years ago

Do you expect that all the agents interacting with your environment will have the same observation_space and action_space?

oroojlooy commented 4 years ago

Do you expect that all the agents interacting with your environment will have the same observation_space and action_space?

For this environment, all the observation_space are the same, but the action_space might be different. Let assume the action_space are also equal. Is there any solution for this case?

rusu24edward commented 4 years ago

If the observation and action space for all agents interacting with your environment is the same, then you can store that info in a singular location (like the environment) and grab it from there. You may add a static function that returns the space. For example,

class CustomEnv(MultiAgentEnv):
    def __init__(self, ....):
        ...

    @staticmethod
    def get_observation_space(param):
        return Box(param,....)

    @staticmethod
    def get_action_space(param):
        return Box(param,....)

    def step(self):
        ...

Then when using ray, you can just do something like:

import CustomEnv
trainer = pg.PGAgent(env=CustomEnv, config={
    "multiagent": {
        "policies": {
            "default": (None, CustomEnv.get_observation_space(), CustomEnv.get_action_space(), {"gamma": 0.85}),
        },
        "policy_mapping_fn": lambda agent_id: "default"
    },
})

while True:
    print(trainer.train())

It doesn't have to look exactly like this, this is just a design idea to get you thinking.

So, you mean that there is no way of having what I want, right?

You have a bit of a design conflict here. Putting a single observation space and a single action_space in the environment indicates, by design, that all agents interacting with the environment can expect to see the same observation and action space. Now, you can still put the observation and action space in the environment even if they are different, but you have to have some kind of mapping from agent "types" to the spaces. Here is an example:

import ray
from ray.rllib.env.multi_agent_env import MultiAgentEnv

from gym.spaces import Box

class CustomEnv(MultiAgentEnv):
    action_mapping = {
        'type0': Box(-1, 1, shape=((12,))),
        'type1': Box(10, 20, shape=((8,2)))
    }
    obs_mapping = {
        'type0': Box(-4, 5, shape=((12,5))),
        'type1': Box(10, 20, shape=((8,2)))
    }

    @staticmethod
    def get_obs_space(type):
        return CustomEnv.obs_mapping[type]

    @staticmethod
    def get_action_space(type):
        return CustomEnv.action_mapping[type]

    def step(self):
        pass

# Test it
obs_space = CustomEnv.get_action_space('type1')
print(obs_space)

Then when using ray, you can just do something like:

import CustomEnv
trainer = pg.PGAgent(env=CustomEnv, config={
    "multiagent": {
        "policies": {
            "type0_policy": (None, CustomEnv.get_observation_space('type0'), CustomEnv.get_action_space('type0'), {"gamma": 0.85}),
            "type1_policy": (None, CustomEnv.get_observation_space('type1'), CustomEnv.get_action_space('type1'), {"gamma": 0.85}),
        },
        "policy_mapping_fn": lambda agent_id: 'type0_policy' if agent_id == 'type0_agent1' else 'type1_policy'
    },
})

while True:
    print(trainer.train())

There are a lot of ways you can do this, it really just depends on how you want to design it.

oroojlooy commented 4 years ago

Thanks for the detailed explanation. This approach works when I have env-agents with different action spaces.