[RLlib] Training on custom contextual bandits with multi-discrete actions throws error because of unexpected tensor shapes

philippGraf commented 2 years ago

What happened + What you expected to happen

I want to train on custom contextual bandits with multi discrete actions, but torch throws an error because of unexpected tensor-shapes:

Failure # 1 (occurred at 2022-04-21_14-36-44)

Traceback (most recent call last):
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): [36mray::BanditLinUCBTrainer.train()[39m (pid=41791, ip=192.168.178.20, repr=BanditLinUCBTrainer)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 349, in train
    result = self.step()
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1093, in step
    raise e
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1074, in step
    step_attempt_results = self.step_attempt()
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1155, in step_attempt
    step_results = self._exec_plan_or_training_iteration_fn()
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2174, in _exec_plan_or_training_iteration_fn
    results = next(self.train_exec_impl)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/util/iter.py", line 779, in __next__
    return next(self.built_iterator)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/util/iter.py", line 807, in apply_foreach
    for item in it:
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
    for item in it:
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/util/iter.py", line 869, in apply_filter
    for item in it:
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/util/iter.py", line 815, in apply_foreach
    result = fn(item)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/execution/train_ops.py", line 343, in __call__
    results = policy.learn_on_loaded_batch(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 572, in learn_on_loaded_batch
    return self.learn_on_batch(batch)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/bandit/bandit_torch_policy.py", line 35, in learn_on_batch
    self.model.partial_fit(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/bandit/bandit_torch_model.py", line 170, in partial_fit
    0 <= arm.item() < len(self.arms)
ValueError: only one element tensors can be converted to Python scalars

Simple discrete action spaces seem to work. Seems to be an older Problem: https://github.com/ray-project/ray/issues/14249

Nevertheless, i am puzzled as the Recommendation Environement provided by rllib itself does also not work and throws another error:

Failure # 1 (occurred at 2022-04-21_14-44-26)

Traceback (most recent call last):
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 901, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/worker.py", line 1811, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::BanditLinUCBTrainer.__init__()[39m (pid=42641, ip=192.168.178.20, repr=BanditLinUCBTrainer)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1035, in _init
    raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

[36mray::BanditLinUCBTrainer.__init__()[39m (pid=42641, ip=192.168.178.20, repr=BanditLinUCBTrainer)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 830, in __init__
    super().__init__(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 149, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 911, in setup
    self.workers = WorkerSet(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 162, in __init__
    self._local_worker = self._make_worker(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 567, in _make_worker
    worker = cls(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 493, in __init__
    self.env = env_creator(copy.deepcopy(self.env_context))
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 2806, in <lambda>
    register_env(name, lambda cfg: env_object(cfg))
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/examples/env/bandit_envs_recommender_system.py", line 70, in __init__
    {
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/ray/rllib/examples/env/bandit_envs_recommender_system.py", line 71, in <dictcomp>
    str(i): gym.spaces.Box(
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/gym/spaces/box.py", line 35, in __init__
    self.low = np.full(self.shape, low)
  File "/home/philipp/workspace/hems-simulator/venv/lib/python3.8/site-packages/numpy/core/numeric.py", line 343, in full
    a = empty(shape, dtype, order)
TypeError: 'EnvContext' object cannot be interpreted as an integer

Versions / Dependencies

python==3.8.10 ray==1.12.0 torch==1.11.0+cu102

Reproduction script

import ray
from ray import tune
from ray.rllib.agents.bandit import BanditLinUCBTrainer
import gym
from ray.rllib.examples.env.bandit_envs_recommender_system import ParametricRecSys
from gym.spaces import MultiDiscrete, Box
import random
import numpy as np
from ray.rllib.agents.trainer import with_common_config
DEFAULT_CONFIG = with_common_config({
    # No remote workers by default.
    "num_workers": 0,
    "framework": "torch",  # Only PyTorch supported so far.

    # Do online learning one step at a time.
    "rollout_fragment_length": 1,
    "train_batch_size": 1,

    # Bandits cant afford to do one timestep per iteration as it is extremely
    # slow because of metrics collection overhead. This setting means that the
    # agent will be trained for 100 times in one iteration of Rllib
    "timesteps_per_iteration": 100,
})

class Test(gym.Env):
    """little test
    """

    def __init__(self, config=None):
        self.action_space = MultiDiscrete([2,2])
        self.observation_space = Box(low=-1.0, high=1.0, shape=(2,))
        self.cur_context = None

    def reset(self):
        self.cur_context = random.choice([-1.0, 1.0])
        return np.array([self.cur_context, -self.cur_context])

    def step(self, action):
        action = (action[0] + 2*action[1])%3
        rewards_for_context = {
            -1.0: [-10, 0, 10],
            1.0: [10, 0, -10],
        }
        reward = rewards_for_context[self.cur_context][action]
        return (
            np.array([-self.cur_context, self.cur_context]),
            reward,
            True,
            {"regret": 10 - reward},
        )

if __name__ == "__main__":

    ray.init()
    DEFAULT_CONFIG['env'] = Test # or ParametricRecSys
    # Run training task using tune.run
    tune_result = tune.run(
        run_or_experiment=BanditLinUCBTrainer,
        config=DEFAULT_CONFIG,
        local_dir='./logs'
    )

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 2 years ago

Hey @philippGraf , thanks for raising this issue. Bandits don't support non-Discrete action spaces, but you can use a MultiDiscrete -> Discrete wrapper to achieve the same thing.

Could you try the following version of your repro script? I confirmed this one runs just fine.

import ray
from ray import tune
from ray.rllib.agents.bandit import BanditLinUCBTrainer
import gym
from ray.rllib.examples.env.bandit_envs_recommender_system import ParametricRecSys
from gym.spaces import MultiDiscrete, Box
import random
import numpy as np
from ray.rllib.env.wrappers.recsim import (
    MultiDiscreteToDiscreteActionWrapper,
)

class Test(gym.Env):
    def __init__(self, config=None):
        self.action_space = MultiDiscrete([2, 2])
        self.observation_space = Box(low=-1.0, high=1.0, shape=(2,))
        self.cur_context = None

    def reset(self):
        self.cur_context = random.choice([-1.0, 1.0])
        return np.array([self.cur_context, -self.cur_context])

    def step(self, action):
        action = (action[0] + 2 * action[1]) % 3
        rewards_for_context = {
            -1.0: [-10, 0, 10],
            1.0: [10, 0, -10],
        }
        reward = rewards_for_context[self.cur_context][action]
        return (
            np.array([-self.cur_context, self.cur_context]),
            reward,
            True,
            {"regret": 10 - reward},
        )

if __name__ == "__main__":
    from ray import tune

    config = {
        "env": "test_env",  # Pre-registered env identifier.

        # No remote workers by default.
        "num_workers": 0,
        "framework": "torch",  # Only PyTorch supported so far.

        # Do online learning one step at a time.
        "rollout_fragment_length": 1,
        "train_batch_size": 1,

        # Bandits cant afford to do one timestep per iteration as it is extremely
        # slow because of metrics collection overhead. This setting means that the
        # agent will be trained for 100 times in one iteration of Rllib
        "timesteps_per_iteration": 100,
    }

    tune.register_env(
        "test_env",
        lambda env_ctx: MultiDiscreteToDiscreteActionWrapper(Test(env_ctx)),
    )

    # Run training task using tune.run
    tune_result = tune.run(
        run_or_experiment=BanditLinUCBTrainer,
        config=config,
        local_dir='./logs'
    )

sven1977 commented 2 years ago

We should add a action space check for bandits, though to make this less confusing for people in the future ...

ray-project / ray