[RLlib] PPO policy evaluation using CLI and Python API produce different reward distributions

berryjk commented 1 year ago

What happened + What you expected to happen

I am getting different reward distributions when I use the Python API vs the rllib CLI on a PPO checkpoint. Below is a comparison of reward histograms produced by each method. This issue has been reproduced on ray 2.4, 2.5, and 2.6 and with other environments. This issue does not occur with DQN.

Bin Edges: [ 0. 25. 50. 75. 100. 125. 150. 175. 200. 225. 250. 275. 300. 325. 350. 375. 400. 425. 450. 475. 500. 525.] CLI Histogram: [ 0 0 0 0 0 0 0 0 1 0 3 6 1 4 3 2 2 2 2 3 71] Python Histogram: [ 0 0 0 0 0 42 41 10 5 1 0 0 0 0 1 0 0 0 0 0 0]

To train I'm using:

$ rllib train file cartpole-ppo.yaml

Where cartpole-ppo.yaml is the one provided in the tuned_examples directory:

cartpole-ppo-troubleshoot:
    env: CartPole-v1
    run: PPO
    stop:
        sampler_results/episode_reward_mean: 150
        timesteps_total: 100000
    config:
        # Works for both torch and tf.
        framework: torch
        gamma: 0.99
        lr: 0.0003
        num_workers: 1
        observation_filter: MeanStdFilter
        num_sgd_iter: 6
        vf_loss_coeff: 0.01
        model:
            fcnet_hiddens: [32]
            fcnet_activation: linear
            vf_share_layers: true
        enable_connectors: true

The checkpoint is evaluated using the CLI and episode rewards are printed to stdout:

$ rllib evaluate --algo PPO --episodes 100 --steps 0 [checkpoint_path]

The same checkpoint is also evaluated using the Python API like this (Note: setting exploration on/off does not resolve the discrepancy and neither does using the compute_single_action method directly from the Policy object):

def evaluate_model_python_api(checkpoint_path, n_episodes=100):
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    algo = Algorithm.from_checkpoint(checkpoint_path)

    episode_rewards = []
    for _ in range(n_episodes):
        episode_reward = 0
        terminated = truncated = False
        obs, info = env.reset()

        while not terminated and not truncated:
            action = algo.compute_single_action(obs, explore=False)
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward  

        episode_rewards.append(episode_reward)
    return episode_rewards

Versions / Dependencies

ray[rllib]==2.6

Reproduction script

Here is the full script that trains, evaluates, scrapes data from stdout, and creates the histograms. Repeating this process with DQN shows matching histograms for each evaluation method.

import gymnasium as gym
from ray.rllib.algorithms import Algorithm
import os
import subprocess
from glob import glob
import numpy as np

def train(yaml_path):
    cmd_str = "rllib train file {} ".format(yaml_path)
    print(cmd_str)
    os.system(cmd_str)

def latest_checkpoint(root_path="~/ray_results"):
    results_dir = os.path.expanduser(root_path)   
    checkpoints = glob(os.path.join(results_dir, "**","checkpoint*"), recursive=True)
    latest_checkpoints = max(checkpoints, key=os.path.getmtime)
    return latest_checkpoints

def evaluate_model_cli(checkpoint_path, n_episodes=100):
    cmd = ["rllib", "evaluate", "--algo", "PPO", "--episodes", str(n_episodes), "--steps", "0", checkpoint_path]
    print(" ".join(cmd))
    stdout_str = subprocess.run(cmd, stdout=subprocess.PIPE).stdout.decode('utf-8')
    stdout_rows = stdout_str.split("\n")
    episode_rewards = [float(row.split(":")[-1]) for row in stdout_rows if row != ""]
    return episode_rewards

def evaluate_model_python_api(checkpoint_path, n_episodes=100):
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    algo = Algorithm.from_checkpoint(checkpoint_path)

    episode_rewards = []
    for _ in range(n_episodes):
        episode_reward = 0
        terminated = truncated = False
        obs, info = env.reset()

        while not terminated and not truncated:
            action = algo.compute_single_action(obs, explore=False)
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward

        episode_rewards.append(episode_reward)
    return episode_rewards

if __name__ == '__main__':
    yaml_dir = os.path.realpath(os.path.dirname(__file__))
    yaml_path = os.path.join(yaml_dir, "cartpole_ppo_example.yaml")
    train(yaml_path)
    checkpoint_path = latest_checkpoint()
    cli_rewards = evaluate_model_cli(checkpoint_path)
    python_rewards = evaluate_model_python_api(checkpoint_path)

    hist_range = [0, 525]
    bins = 21

    cli_hist, bin_edges = np.histogram(cli_rewards, bins=bins, range=hist_range)
    python_hist, bin_edges = np.histogram(python_rewards, bins=bins, range=hist_range)

    print("\n\n")
    print("Bin Edges: {}".format(bin_edges))
    print("CLI Histogram:    {}".format(cli_hist))
    print("Python Histogram: {}".format(python_hist))

Issue Severity

High: It blocks me from completing my task.

berryjk commented 1 year ago

Replicating this comparison for several of the other tuned cartpole examples shows the reward histograms match well for SimpleQ and DQN but not for A3C, A2C, PPO, SAC, APPO, and DDPPO.

ArturNiederfahrenhorst commented 1 year ago

@berryjk Thanks for putting in the time to create this issue! Just to make this clear before I start debugging this: The same checkpoint gives you different results when you evaluate with CLI vs with the python script for A3C, A2C, PPO, SAC, APPO, and DDPPO? (You are not comparing "training with CLI and evaluation with CLI" against "training with script and evaluation with script", right?)

berryjk commented 1 year ago

@ArturNiederfahrenhorst That is correct. The discrepancy was observed by evaluating a single checkpoint using the two methods (as demonstrated in the provided code snippet). In each experiment the checkpoints were produced using the CLI. Thanks for your patience with my delayed response.

berryjk commented 1 year ago

@ArturNiederfahrenhorst have you been able to reproduce the discrepancy?

berryjk commented 1 year ago

@ArturNiederfahrenhorst is there any interest by the RLLib team in reproducing this issue?

ArturNiederfahrenhorst commented 1 year ago

Yes, there is a high interest. But we are under high load at the moment. We will get to this issue as soon as our prioritization permits it.

berryjk commented 1 year ago

Thanks @ArturNiederfahrenhorst . Let me know how I can help.

ArturNiederfahrenhorst commented 1 year ago

Can you check what happens if you turn of preprocessing for the training? (config..experimental(_disable_preprocessor_api=True))

Since you do...

env_name = "CartPole-v1"
env = gym.make(env_name)

The env that you create is not wrapped in anything. It would be helpful if you could play around with this and see if it changes anything.

berryjk commented 1 year ago

@ArturNiederfahrenhorst the training is currently being performed with a CLI call using the cartpole-ppo.yaml config file provided in the tuned_examples directory. Your comment seems to suggest that I should change over to the Python API to configure and run training. I should then reproduce the original issue using Python API to drive training with no changes to the config. Then I would generate results with your proposed config change of config..experimental(_disable_preprocessor_api=True) to see if anything changed. Is this correct?

jesuspc commented 11 months ago

In case it helps, I was struggling with an issue that had a similar effect as the one described here, where rewards during training were much higher than the ones I was obtaining from fetching actions on the trained model, all in the Python API. I tried _disable_preprocessor_api=True with no effect. In particular I was training a TD3 model:

config = (
    TD3Config()
    .framework("torch")
    .rollouts(create_env_on_local_worker=True, observation_filter="MeanStdFilter")
    .environment(env=MyEnv, env_config=my_env_config)
)

And evaluating with:

model = Algorithm.from_checkpoint(path)
env = MyEnv(my_env_config)
obs, _ = env.reset()
state = None
for i in range(env.total_steps):
    action = model.compute_single_action(obs, state, explore=False)
    obs, _reward, _terminated, _, _ = env.step(action)

Where MyEnv is a custom class inheriting from gym.Env.

I reproduced the issue both when running training with tune and directly with .train(), and both using the raw model immediately after training and loading it from a checkpoint file. I've seen it on 2.7.0 and 2.7.1, and on an M2 Mac and Ubuntu 22.04.

In this process I noticed that compute_single_action was not calling any MeanStdFilter instance (by adding log statements in rllib/utils/filter.py). I worked around the issue by calling it manually before passing the observation to compute_single_action:

c = model.get_policy().agent_connectors.connectors[1]
action = model.compute_single_action(c.filter(obs), state, explore=False)

After this the evaluation actions started matching the ones I was getting during training.

ray-project / ray