[RLLib] Evaluation duration doesn't match with results num episodes for evaluation

n30111 commented 4 months ago

What happened + What you expected to happen

When using evaluation_num_env_runners > 1, for RLLib evaluation, the results ["evaluation"]["env_runners"]["num_episodes"] is not equal to the evaluation_duration set in the configuration.

Versions / Dependencies

Ray 2.31 Python 3.11 Linux

Reproduction script

from ray.tune.tuner import Tuner
from ray import tune, train

stopping_criteria = {"training_iteration": 2}
param_space ={
        "env": "LunarLanderContinuous-v2",
        "kl_coeff": 1.0,
        "num_workers": 0,
        "num_cpus": 0.5,  # number of CPUs to use per trial
        "num_gpus": 0,  # number of GPUs to use per trial
        "lambda": 0.95,
        "clip_param": 0.2,
        "lr": 1e-4,
        "evaluation_interval":1,
        "evaluation_duration":6,
        "evaluation_num_env_runners":1,
}

tuner = Tuner("PPO",
    tune_config=tune.TuneConfig(
        metric="env_runners/episode_reward_mean",
        mode="max",
        num_samples=1,
    ),
    param_space=param_space,
    run_config=train.RunConfig(stop=stopping_criteria),
)

result_grid = tuner.fit()
res = result_grid._experiment_analysis  # pylint: disable=protected-access
print(res.trials[0].last_result["evaluation"]["env_runners"]["num_episodes"])
assert param_space["evaluation_duration"] == res.trials[0].last_result["evaluation"]["env_runners"]["num_episodes"]

Issue Severity

Medium: It is a significant difficulty but I can work around it.

simonsays1980 commented 4 months ago

@n30111 Thanks for raising this issue. I could reproduce it. It is somewhere in the old stack. I can run the example without errors using the new stack:

from ray.tune.tuner import Tuner
from ray import tune, train

stopping_criteria = {"training_iteration": 2}
param_space ={
        "env": "LunarLander-v2",
        "env_config": {"continuous": True},
        "enable_rl_module_and_learner": True,
        "enable_env_runner_and_connector_v2": True,
        "kl_coeff": 1.0,
        "num_workers": 0,
        "num_cpus": 0.5,  # number of CPUs to use per trial
        "num_gpus": 0,  # number of GPUs to use per trial
        "lambda": 0.95,
        "clip_param": 0.2,
        "lr": 1e-4,
        "evaluation_interval":1,
        "evaluation_duration":6,
        "evaluation_num_env_runners":1,
}

tuner = Tuner("PPO",
    tune_config=tune.TuneConfig(
        metric="env_runners/episode_return_mean",
        mode="max",
        num_samples=1,
    ),
    param_space=param_space,
    run_config=train.RunConfig(stop=stopping_criteria),
)

result_grid = tuner.fit()
res = result_grid._experiment_analysis  # pylint: disable=protected-access
print(res.trials[0].last_result["evaluation"]["env_runners"]["num_episodes"])
assert param_space["evaluation_duration"] == res.trials[0].last_result["evaluation"]["env_runners"]["num_episodes"]

Maybe this is an alternative for you?

n30111 commented 4 months ago

We are dependent on the old stack.

ray-project / ray