[RLlib] Discrete action space start value not followed in multi-agent

man2machine commented 10 months ago

What happened + What you expected to happen

When a discrete action space is used with a non-zero start value, the actions generated by the Ray policy algorithm does not respect this, and as a result the actions given are outside of the space. I was able to reproduce the error consistently in the script below. I found this while trying to use RLlib multi-agent, however, the problem could exist for single agent as well (I did not test for that).

Versions / Dependencies

OS: Linux Python: 3.11 Ray: 2.6.3

Reproduction script

from typing import Any, Self, final
from collections.abc import Mapping

import gymnasium.spaces as spaces

from ray.tune.registry import register_env  # type: ignore
from ray.rllib.env.multi_agent_env import MultiAgentEnv as RayMultiAgentEnv
from ray.rllib.evaluation.episode import Episode
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.algorithms.appo import APPOConfig

State = Any
Action = Any
AgentId = str

@final
class TestEnv(RayMultiAgentEnv):
    observation_space: Mapping[AgentId, spaces.Space[State]]  # type: ignore
    action_space: Mapping[AgentId, spaces.Space[Action]]  # type: ignore

    _agent_ids: list[AgentId] | set[AgentId]
    _action_space_in_preferred_format: bool | None
    _obs_space_in_preferred_format: bool | None

    def __init__(
        self: Self
    ) -> None:

        # following example from
        # https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_different_spaces_for_agents.py

        self.observation_space = spaces.Dict({
            'main': spaces.Box(0, 1)
        })
        self.action_space = spaces.Dict({
            'main': spaces.Discrete(10, start=1)
        })

        self._agent_ids = ['main']
        self._action_space_in_preferred_format = True
        self._obs_space_in_preferred_format = True

        super().__init__()

    def reset(  # type: ignore
        self: Self,
        *,
        seed: int | None = None,
        options: dict[str, Any] | None = None
    ) -> tuple[dict[AgentId, State], dict[AgentId, dict[AgentId, Any]]]:  # type: ignore

        self.n = 20

        observations: dict[AgentId, State] = {}
        state = self.observe('main')
        observations['main'] = state

        return observations, {}

    def observe(
        self: Self,
        agent: AgentId
    ) -> Any | None:

        return self.observation_space[agent].sample()

    def step(  # type: ignore
        self: Self,
        action_dict: dict[AgentId, Action]
    ) -> tuple[
        dict[AgentId, State],
        dict[AgentId, float],
        dict[AgentId, bool],
        dict[AgentId, bool],
        dict[AgentId, dict[AgentId, Any]]
    ]:

        observations: dict[AgentId, State] = {}
        rewards: dict[AgentId, float] = {}
        terminations: dict[AgentId, bool] = {}
        truncations: dict[AgentId, bool] = {}
        infos: dict[AgentId, dict[AgentId, Any]] = {}

        for agent_id in action_dict:
            assert self.action_space[agent_id].contains(action_dict[agent_id]), (
                "Wanted {} space, got {} instead".format(self.action_space[agent_id], action_dict[agent_id])
            )

        self.n -= 1

        observations['main'] = self.observe('main')
        rewards['main'] = abs(action_dict['main'] - 5)
        terminations['__all__'] = (self.n == 0)
        truncations['__all__'] = False

        return observations, rewards, terminations, truncations, infos

def make_ray_env(
    env_config: dict[str, Any]
) -> RayMultiAgentEnv:

    return TestEnv()

def policy_mapping_fn(
    agent_id: AgentId,
    episode: Episode,
    worker: Any,
    **kwargs: Any
) -> str:

    return 'main'

register_env('test_env', make_ray_env)

config = APPOConfig()
config.training(lr=0.01, grad_clip=30.0)  # type: ignore
config.multi_agent(  # type: ignore
    policies={
        'main': PolicySpec()
    },
    policy_mapping_fn=policy_mapping_fn  # type: ignore
)
config.environment(env='test_env')  # type: ignore

algo = config.build()

algo.train()

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 8 months ago

Hey @man2machine thanks for raising this issue. This is really interesting. I actually did not know, you could set a start arg in a Discrete space :)

As a simple solution, could you make the extra 1-shift inside your env's step() code?

Like so:


def step(self, action_dict):
    action_dict = OrderedDict({k: a+1 for k, a in action_dict.items()})
    ...  # continue with this shifted dict

I'm trying to PR a better solution in the meantime. I tried gym.ActionWrapper around your env, but RLlib's env checker and also the multi-agent env does not allow this b/c a gym.ActionWrapper is NOT an RLlib BaseEnv or an RLlib MultiAgentEnv, so more issues will surface.

man2machine commented 3 months ago

@sven1977 Apologies for getting back to you so late. I ended up doing something similar to what you said where I handled the start within the environment step code. Either way, this is something useful that rllib should support since it is part of the gym spaces library and other people may run into the same issue.

ray-project / ray