ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.57k stars 5.71k forks source link

[RLLib] 'EpisodeV2' missing key functions and attributes related to env #37319

Open vymao opened 1 year ago

vymao commented 1 year ago

What happened + What you expected to happen

The Episode class provided the method last_info_for to pull the info dict return to the agent at the latest step. Now EpisodeV2 doesn't have such a method, and subsequently returns errors like 'EpisodeV2' object has no attribute 'last_info_for'. This seems to also break the /rllib/examples/custom_metrics_and_callbacks.py example provided (different error but similar motivation).

Is there a recommended workaround for this? Can we opt to use Episode, or is there a version of Ray that we would need to downgrade to to have this?

Versions / Dependencies

Ray 2.5.0

Reproduction script

Using this as a callback in training:

class RewardLoggerCallback(DefaultCallbacks):
    def on_episode_start(
        self, *, worker, base_env, policies, episode, env_index, **kwargs
    ):
        episode.user_data = {
            'MainRew': 0
        }

    def on_episode_step(
        self, *, worker, base_env, episode, env_index, **kwargs
    ):
        # Running metrics -> keep all values
        # Final metrics -> only keep the current value
        info = episode.last_info_for()
        for k in episode.user_data.keys(): 
            episode.user_data[k].append(info[k])

    def on_episode_end(
        self, *, worker, base_env, policies, episode, env_index, **kwargs
    ):
        for name, value in episode.user_data.items():
            episode.custom_metrics[name + "_avg"] = np.mean(value)
            episode.custom_metrics[name + "_sum"] = np.sum(value)
            episode.hist_data[name] = value
    algo = ray_td3.TD3Config().environment(env="HumanoidV4").training(
        actor_hiddens=[256],
        critic_hiddens=[562]
    ).callbacks(
        callbacks_class=RewardLoggerCallback
    ).build()

Issue Severity

High: It blocks me from completing my task.

antoine-galataud commented 1 year ago

in addition to last_info_for there seems other useful methods disappeared in EpisodeV2 like last_observation_for and last_raw_obs_for.

If you modify rllib/examples/custom_metrics_and_callbacks.py to train on PPO, you end up with exceptions on missing methods. Sample code provided below:

"""Example of using RLlib's debug callbacks.

Here we use callbacks to track the average CartPole pole angle magnitude as a
custom metric.

We then use `keep_per_episode_custom_metrics` to keep the per-episode values
of our custom metrics and do our own summarization of them.
"""

import argparse
import os
from typing import Dict

import gymnasium as gym
import numpy as np
import ray
from ray import air, tune
from ray.rllib.algorithms.callbacks import DefaultCallbacks
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env import BaseEnv
from ray.rllib.evaluation import Episode, RolloutWorker
from ray.rllib.policy import Policy

parser = argparse.ArgumentParser()
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "torch"],
    default="torch",
    help="The DL framework specifier.",
)
parser.add_argument("--stop-iters", type=int, default=2000)

# Create a custom CartPole environment that maintains an estimate of velocity
class CustomCartPole(gym.Env):
    def __init__(self, config):
        self.env = gym.make("CartPole-v1")
        self.observation_space = self.env.observation_space
        self.action_space = self.env.action_space
        self._pole_angle_vel = 0.0
        self.last_angle = 0.0

    def reset(self, *, seed=None, options=None):
        self._pole_angle_vel = 0.0
        obs, info = self.env.reset()
        self.last_angle = obs[2]
        return obs, info

    def step(self, action):
        obs, rew, term, trunc, info = self.env.step(action)
        angle = obs[2]
        self._pole_angle_vel = (
            0.5 * (angle - self.last_angle) + 0.5 * self._pole_angle_vel
        )
        info["pole_angle_vel"] = self._pole_angle_vel
        return obs, rew, term, trunc, info

class MyCallbacks(DefaultCallbacks):

    def on_episode_step(
        self,
        *,
        worker: RolloutWorker,
        base_env: BaseEnv,
        policies: Dict[str, Policy],
        episode: Episode,
        env_index: int,
        **kwargs
    ):
        # Make sure this episode is ongoing.
        assert episode.length > 0, (
            "ERROR: `on_episode_step()` callback should not be called right "
            "after env reset!"
        )
        pole_angle = abs(episode.last_observation_for()[2])
        raw_angle = abs(episode.last_raw_obs_for()[2])
        assert pole_angle == raw_angle
        episode.user_data["pole_angles"].append(pole_angle)

        # Sometimes our pole is moving fast. We can look at the latest velocity
        # estimate from our environment and log high velocities.
        if np.abs(episode.last_info_for()["pole_angle_vel"]) > 0.25:
            print("This is a fast pole!")

if __name__ == "__main__":
    args = parser.parse_args()

    config = (
        PPOConfig()
        .environment(CustomCartPole)
        .framework(args.framework)
        .callbacks(MyCallbacks)
        .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
        .rollouts(enable_connectors=True)
        .reporting(keep_per_episode_custom_metrics=True)
    )

    ray.init(local_mode=True)
    tuner = tune.Tuner(
        "PPO",
        run_config=air.RunConfig(
            stop={
                "training_iteration": args.stop_iters,
            },
        ),
        param_space=config,
    )
    # there is only one trial involved.
    result = tuner.fit().get_best_result()

    # Verify episode-related custom metrics are there.
    custom_metrics = result.metrics["custom_metrics"]
    print(custom_metrics)
    assert "pole_angle_mean" in custom_metrics
    assert "pole_angle_var" in custom_metrics

Is there a workaround for this?

antoine-galataud commented 1 year ago

So far I see 2 possible solutions:

@ArturNiederfahrenhorst let me know your thoughts, I'd be happy to work on a PR.

antoine-galataud commented 1 year ago

Here is a workaround, for whom it may concern: when using PPO you can force use of Episode (v1) by disabling new RL Module, connectors and learner API. Sample config:

config = (
    PPOConfig()
    .rl_module(_enable_rl_module_api=False)
    .training(_enable_learner_api=False)
    .rollouts(enable_connectors=False)
    .environment(CustomCartPole)
    .framework(args.framework)
    .callbacks(MyCallbacks)
    .resources(num_gpus=int(os.environ.get("RLLIB_NUM_GPUS", "0")))
    .reporting(keep_per_episode_custom_metrics=True)
)
ArturNiederfahrenhorst commented 1 year ago

@antoine-galataud We are moving away from EnvRunnerV2, so such efforts should go into https://sourcegraph.com/github.com/ray-project/ray/-/blob/rllib/env/env_runner.py. Thanks for offering your help - can you hold back for 1-2 weeks? After https://github.com/ray-project/ray/pull/39732 is merged, there should be a clearer picture on master about how such Episodes are built in PPO.

Thereafter, there will likely be an EpisodeV3, where these changes should go.

CC @sven1977 @simonsays1980

CarlDegio commented 11 months ago

Hello! I would like to know if there is a way to read this information currently? I can't read action and observation in the training loop.@ArturNiederfahrenhorst