[RLLIB] Issue with AlphaZero algorithm Stateless CartPole

What happened + What you expected to happen

I was trying to run AlphaZero algorithm with StateLess CartPole environment but getting the following issue

2023-09-28 10:56:03,591 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-09-28 10:56:03,823 WARNING deprecation.py:50 -- DeprecationWarning: `rllib/algorithms/alpha_star/` has been deprecated. Use `rllib_contrib/alpha_star/` instead. This will raise an error in the future!
2023-09-28 10:56:03,823 WARNING algorithm_config.py:656 -- Cannot create AlphaZeroConfig from given `config_dict`! Property num_rollout_worker not supported.
2023-09-28 10:56:03,823 WARNING deprecation.py:50 -- DeprecationWarning: `algo = Algorithm(env='<class 'ray.rllib.examples.env.stateless_cartpole.StatelessCartPole'>', ...)` has been deprecated. Use `algo = AlgorithmConfig().environment('<class 'ray.rllib.examples.env.stateless_cartpole.StatelessCartPole'>').build()` instead. This will raise an error in the future!
/........../lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
/........../lib/python3.11/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
/........../lib/python3.11/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
/........../lib/python3.11/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-09-28 10:56:07,322 INFO worker.py:1621 -- Started a local Ray instance.
(pid=21130) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-09-28 10:56:11,952 ERROR actor_manager.py:500 -- Ray error, taking actor 1 out of service. The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=21130, ip=127.0.0.1, actor_id=ee2e4603890893fe9f0283a701000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x11193c6d0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
    self._build_policy_map(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
    new_policy = create_policy_for_framework(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 327, in __init__
    super().__init__(
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero_policy.py", line 38, in __init__
    self.env = self.env_creator()
               ^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 317, in _env_creator
    return env_cls(config["env_config"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 43, in __init__
    self._initialize_buffer(r2_config["num_init_rewards"])
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 51, in _initialize_buffer
    mask = obs["action_mask"]
           ~~~^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
2023-09-28 10:56:11,953 ERROR actor_manager.py:500 -- Ray error, taking actor 2 out of service. The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=21131, ip=127.0.0.1, actor_id=4b3831cf40893720b5fa9a2601000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x107d33a90>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
    self._build_policy_map(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
    new_policy = create_policy_for_framework(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 327, in __init__
    super().__init__(
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero_policy.py", line 38, in __init__
    self.env = self.env_creator()
               ^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 317, in _env_creator
    return env_cls(config["env_config"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 43, in __init__
    self._initialize_buffer(r2_config["num_init_rewards"])
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 51, in _initialize_buffer
    mask = obs["action_mask"]
           ~~~^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
(RolloutWorker pid=21130) 2023-09-28 10:56:11,907       WARNING env.py:162 -- Your env doesn't have a .spec.max_episode_steps attribute. Your horizon will default to infinity, and your environment will not be reset.
(RolloutWorker pid=21130) 2023-09-28 10:56:11,909       WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.models.torch.fcnet.FullyConnectedNetwork` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21130) 2023-09-28 10:56:11,909       WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.models.torch.torch_modelv2.TorchModelV2` has been deprecated. Use `ray.rllib.core.rl_module.rl_module.RLModule` instead. This will raise an error in the future!
(RolloutWorker pid=21130) 2023-09-28 10:56:11,918       WARNING deprecation.py:50 -- DeprecationWarning: `TorchPolicy` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21130) 2023-09-28 10:56:11,919       WARNING deprecation.py:50 -- DeprecationWarning: `StochasticSampling` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21130) 2023-09-28 10:56:11,919       WARNING deprecation.py:50 -- DeprecationWarning: `Exploration` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21130) 2023-09-28 10:56:11,919       WARNING deprecation.py:50 -- DeprecationWarning: `Random` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21130) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=21130, ip=127.0.0.1, actor_id=ee2e4603890893fe9f0283a701000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x11193c6d0>)
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
(RolloutWorker pid=21130)     self._update_policy_map(policy_dict=self.policy_dict)
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
(RolloutWorker pid=21130)     self._build_policy_map(
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
(RolloutWorker pid=21130)     new_policy = create_policy_for_framework(
(RolloutWorker pid=21130)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
(RolloutWorker pid=21130)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 327, in __init__
(RolloutWorker pid=21130)     super().__init__(
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero_policy.py", line 38, in __init__
(RolloutWorker pid=21130)     self.env = self.env_creator()
(RolloutWorker pid=21130)                ^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 317, in _env_creator
(RolloutWorker pid=21130)     return env_cls(config["env_config"])
(RolloutWorker pid=21130)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/worker_set.py", line 157, in __init__
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 43, in __init__
(RolloutWorker pid=21130)     self._initialize_buffer(r2_config["num_init_rewards"])
(RolloutWorker pid=21130)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 51, in _initialize_buffer
(RolloutWorker pid=21130)     mask = obs["action_mask"]
(RolloutWorker pid=21130)            ~~~^^^^^^^^^^^^^^^
(RolloutWorker pid=21130) IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
    self._setup(
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/worker_set.py", line 227, in _setup
    self.add_workers(
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/worker_set.py", line 593, in add_workers
    raise result.get()
  File "/........../lib/python3.11/site-packages/ray/rllib/utils/actor_manager.py", line 481, in __fetch_result
    result = ray.get(r)
             ^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/_private/worker.py", line 2526, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=21130, ip=127.0.0.1, actor_id=ee2e4603890893fe9f0283a701000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x11193c6d0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
    self._build_policy_map(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
    new_policy = create_policy_for_framework(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 327, in __init__
    super().__init__(
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero_policy.py", line 38, in __init__
    self.env = self.env_creator()
               ^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 317, in _env_creator
    return env_cls(config["env_config"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 43, in __init__
    self._initialize_buffer(r2_config["num_init_rewards"])
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 51, in _initialize_buffer
    mask = obs["action_mask"]
           ~~~^^^^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/Users/paula/Desktop/Projects/RL Practice/RLLIB_Practice4/stateless_cartpole_alphazero.py", line 21, in <module>
    a = AlphaZero(**nn_kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/utils/deprecation.py", line 106, in patched_init
    return obj_init(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 517, in __init__
    super().__init__(
  File "/........../lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 639, in setup
    self.workers = WorkerSet(
                   ^^^^^^^^^^
  File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/worker_set.py", line 179, in __init__
    raise e.args[0].args[2]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
(pid=21131) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=21131) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=21131, ip=127.0.0.1, actor_id=4b3831cf40893720b5fa9a2601000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x107d33a90>)
(RolloutWorker pid=21131)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21131)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 43, in __init__ [repeated 4x across cluster]
(RolloutWorker pid=21131)     self._update_policy_map(policy_dict=self.policy_dict)
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
(RolloutWorker pid=21131)     self._build_policy_map(
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
(RolloutWorker pid=21131)     new_policy = create_policy_for_framework(
(RolloutWorker pid=21131)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
(RolloutWorker pid=21131)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=21131)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21131)     super().__init__(
(RolloutWorker pid=21131)     self.env = self.env_creator()
(RolloutWorker pid=21131)                ^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/alpha_zero.py", line 317, in _env_creator
(RolloutWorker pid=21131)     return env_cls(config["env_config"])
(RolloutWorker pid=21131)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RolloutWorker pid=21131)     self._initialize_buffer(r2_config["num_init_rewards"])
(RolloutWorker pid=21131)   File "/........../lib/python3.11/site-packages/ray/rllib/algorithms/alpha_zero/ranked_rewards.py", line 51, in _initialize_buffer
(RolloutWorker pid=21131)     mask = obs["action_mask"]
(RolloutWorker pid=21131)            ~~~^^^^^^^^^^^^^^^
(RolloutWorker pid=21131) IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Versions / Dependencies

ray : 2.6.3 os : mac os x

Reproduction script

Following is the script that I used

from ray.rllib.examples.env.stateless_cartpole import StatelessCartPole
from ray.rllib.algorithms.alpha_zero import AlphaZero

nn_config= {
        # config to pass to env class
        # "env_config": env_config,
        #neural network config
        "lr": 0.003,
        # "model": model_dict,
        "gamma": 0.95,
        "train_batch_size":20_000,
        "num_rollout_worker":1,
        "training": {"_enable_learner_api": False},
        "rl_module": {'_enable_rl_module_api':False},
    }

nn_kwargs = {"env": StatelessCartPole,
            "config": nn_config
                    }

a = AlphaZero(**nn_kwargs)

print(a)

Issue Severity

High: It blocks me from completing my task.

ray-project / ray

[RLLIB] Issue with AlphaZero algorithm Stateless CartPole #39937

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity