[Bug] Wrong observation type when creating videos during evaluation of AlphaZero algorithm

cirrostratus1 commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

Run AlphaZero algorithm with enabled video generation during evaluation. The process crashes due to observation not contained in the observation space:

ValueError: ('Observation ({} dtype={}) outside given space ({})!', array([ 0.02788446, -0.02576767, -0.0316641 ,  0.02233604], dtype=float32), None, Dict(action_mask:Box([0. 0.], [1. 1.], (2,), float32), obs:Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)))

The problem is that the observation only contains the value of the obs key, but should be a dict containing the observation and action mask. The error only occurs during evaluation, which is run on the local worker.

Expected behavior: The script should run without errors and produce videos in the output folder.

The used environment is a modified version of the original CartPole environment for AlphaZero, where code for video creation was added and bug #19861 regarding the dtype of the observation was fixed.

Versions / Dependencies

Ray 1.8.0, Python 3.7, Ubuntu 18.04 LTS, ffmpeg

Reproduction script

"""Example of using training on CartPole."""
from copy import deepcopy

import gym
import numpy as np
import ray
from gym.spaces import Discrete, Dict, Box
from ray import tune
from ray.rllib.contrib.alpha_zero.models.custom_torch_models import DenseModel
from ray.rllib.models.catalog import ModelCatalog

class VideoCartPole:
    """
    Wrapper for gym CartPole environment where the reward
    is accumulated to the end
    """

    def __init__(self, config=None):
        self.env = gym.make("CartPole-v0")
        self.action_space = Discrete(2)
        self.observation_space = Dict({"obs": self.env.observation_space,
            "action_mask": Box(low=0, high=1, shape=(self.action_space.n,), dtype=np.float32)})
        self.running_reward = 0

    @property
    def metadata(self):
        return self.env.metadata

    @property
    def spec(self):
        return self.env.spec

    def close(self):
        self.env.close()

    def reset(self):
        self.running_reward = 0
        return {"obs": self.env.reset().astype(np.float32), "action_mask": np.array([1, 1], dtype=np.float32)}

    def render(self, mode="rbg_array"):
        return (np.random.random((640, 480, 3)) * 255).astype(np.uint8)

    def step(self, action):
        obs, rew, done, info = self.env.step(action)
        self.running_reward += rew
        score = self.running_reward if done else 0
        return {"obs": obs.astype(np.float32), "action_mask": np.array([1, 1], dtype=np.float32)}, score, done, info

    def set_state(self, state):
        self.running_reward = state[1]
        self.env = deepcopy(state[0])
        obs = np.array(list(self.env.unwrapped.state))
        return {"obs": obs.astype(np.float32), "action_mask": np.array([1, 1], dtype=np.float32)}

    def get_state(self):
        return deepcopy(self.env), self.running_reward

if __name__ == "__main__":
    ray.init(num_cpus=4)

    ModelCatalog.register_custom_model("dense_model", DenseModel)

    tune.run(
        "contrib/AlphaZero",
        stop={"training_iteration": 10000},
        max_failures=0,
        config={
            "env": VideoCartPole,
            "num_workers": 3,
            "rollout_fragment_length": 50,
            "train_batch_size": 50,
            "sgd_minibatch_size": 32,
            "lr": 1e-4,
            "num_sgd_iter": 1,
            "mcts_config": {
                "puct_coefficient": 1.5,
                "num_simulations": 100,
                "temperature": 1.0,
                "dirichlet_epsilon": 0.20,
                "dirichlet_noise": 0.03,
                "argmax_tree_policy": False,
                "add_dirichlet_noise": True,
            },
            "ranked_rewards": {
                "enable": True,
            },
            "model": {
                "custom_model": "dense_model",
            },
            "evaluation_interval": 1,
            "evaluation_config": {
                "render_env": True,
                "mcts_config": {
                    "argmax_tree_policy": True,
                    "add_dirichlet_noise": False,
                },
            },
        },
    )

Anything else

Occurs every time.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray