Closed philippGraf closed 2 years ago
Hey @philippGraf , thanks for raising this issue. Bandits don't support non-Discrete action spaces, but you can use a MultiDiscrete -> Discrete wrapper to achieve the same thing.
Could you try the following version of your repro script? I confirmed this one runs just fine.
import ray
from ray import tune
from ray.rllib.agents.bandit import BanditLinUCBTrainer
import gym
from ray.rllib.examples.env.bandit_envs_recommender_system import ParametricRecSys
from gym.spaces import MultiDiscrete, Box
import random
import numpy as np
from ray.rllib.env.wrappers.recsim import (
MultiDiscreteToDiscreteActionWrapper,
)
class Test(gym.Env):
def __init__(self, config=None):
self.action_space = MultiDiscrete([2, 2])
self.observation_space = Box(low=-1.0, high=1.0, shape=(2,))
self.cur_context = None
def reset(self):
self.cur_context = random.choice([-1.0, 1.0])
return np.array([self.cur_context, -self.cur_context])
def step(self, action):
action = (action[0] + 2 * action[1]) % 3
rewards_for_context = {
-1.0: [-10, 0, 10],
1.0: [10, 0, -10],
}
reward = rewards_for_context[self.cur_context][action]
return (
np.array([-self.cur_context, self.cur_context]),
reward,
True,
{"regret": 10 - reward},
)
if __name__ == "__main__":
from ray import tune
config = {
"env": "test_env", # Pre-registered env identifier.
# No remote workers by default.
"num_workers": 0,
"framework": "torch", # Only PyTorch supported so far.
# Do online learning one step at a time.
"rollout_fragment_length": 1,
"train_batch_size": 1,
# Bandits cant afford to do one timestep per iteration as it is extremely
# slow because of metrics collection overhead. This setting means that the
# agent will be trained for 100 times in one iteration of Rllib
"timesteps_per_iteration": 100,
}
tune.register_env(
"test_env",
lambda env_ctx: MultiDiscreteToDiscreteActionWrapper(Test(env_ctx)),
)
# Run training task using tune.run
tune_result = tune.run(
run_or_experiment=BanditLinUCBTrainer,
config=config,
local_dir='./logs'
)
We should add a action space check for bandits, though to make this less confusing for people in the future ...
What happened + What you expected to happen
I want to train on custom contextual bandits with multi discrete actions, but torch throws an error because of unexpected tensor-shapes:
Failure # 1 (occurred at 2022-04-21_14-36-44)
Simple discrete action spaces seem to work. Seems to be an older Problem: https://github.com/ray-project/ray/issues/14249
Nevertheless, i am puzzled as the Recommendation Environement provided by rllib itself does also not work and throws another error:
Failure # 1 (occurred at 2022-04-21_14-44-26)
Versions / Dependencies
python==3.8.10 ray==1.12.0 torch==1.11.0+cu102
Reproduction script
Issue Severity
High: It blocks me from completing my task.