ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.5k stars 5.69k forks source link

[RLlib] `unroll_id` holds only odd numbers? #37643

Open simonsays1980 opened 1 year ago

simonsays1980 commented 1 year ago

What happened + What you expected to happen

What happened

I sampled from an env using the RolloutWorker and batch_mode="complete_episodes". The batches returned possessed only odd number.s in their unroll_id. I don't think that a batch was missing, but I have the feeling that the unroll_id gets incremented two times in the AgentCollector when adding init_obs.

I do not know if this is even intentional, but it makes the user think that some data got missing in sampling.

Debugging pointed me to the two lines:

  1. policy.agent_connectors(acd_list)
  2. episode.add_init_obs()

What you expected to happen

That unroll_id in the batches are incremented by a single step, like 1,2,3,4,....

Versions / Dependencies

Ray 2.6.0 Python 3.9.12 Fedora 37

Reproduction script

import gymnasium as gym
import numpy as np

from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.algorithms.ppo.ppo_tf_policy import PPOTF2Policy
from ray.rllib.evaluation.worker_set import WorkerSet

config = (
    PPOConfig()
    .environment(
        env="CartPole-v1",
    )
    .framework(
        framework="tf2",
    )
    .rollouts(
        rollout_fragment_length=1,
        batch_mode="complete_episodes",
        num_envs_per_worker=2,
        num_rollout_workers=1,
        observation_filter="MeanStdFilter",
    )
    .rl_module(
        _enable_rl_module_api=True,
    )
    .training(
        _enable_learner_api=True,
    )
    .debugging(
        log_level="DEBUG",
    )
)

config.in_evaluation = True

workers = worker = WorkerSet(
    env_creator=lambda ctx: gym.make("CartPole-v1"),
    config=config,
    default_policy_class=PPOTF2Policy,
    num_workers=1,
    local_worker=False,
)

batches = []
for i in range(4):
    batches.append(workers.foreach_worker(lambda w: w.sample()))

for batch in batches:
    print(np.mean(batch[0]["default_policy"]["unroll_id"]))

Issue Severity

Low: It annoys or frustrates me.

ArturNiederfahrenhorst commented 1 year ago

I can reproduce and am attempting a fix :) Thanks for reporting this!

ArturNiederfahrenhorst commented 1 year ago

The issue is that the AgentCollector is used in two settings:

The unroll ID is stores as a class attribute that is shared between these settings. But only the episodes samples in the EnvRunnerV2 are sampled from the RW.

@simonsays1980 Your expected behaviour is obviously that sampling from the RW should yield continuous unroll IDs, right? But RLlib does not give guarantees around this. And they are still suitable to disambiguate between unrolls.

@sven1977 Can you advise? Do we want to treat this as a non-issue?

simonsays1980 commented 1 year ago

@ArturNiederfahrenhorst Thanks for the explanation. And good to hear that this does not point to sth being off on user side. I just wondered and digged in a bit.

I wonder, if it is then even needed that the unroll_ids are consecutive in nature or could also be made unique by a simple code like the episode_ids? If, however, they need to be consecutive, missing numbers could make a difference.

sven1977 commented 1 year ago

Good catch @simonsays1980 , and thanks for the analysis @ArturNiederfahrenhorst . It's probably not a P1, but we should fix this. Maybe we should move this out of AgentCollector altogether and only use this in rollout worker directly?

ArturNiederfahrenhorst commented 1 year ago

@simonsays1980 We are aiming to rewrite the sampling backend of rllib to provide more flexibility to write your own sampling logic. We will keep this in the back of our heads!

ArturNiederfahrenhorst commented 1 year ago

I've created a new label "rllib-samplingbackend" to track these. Over time, I'll collect such issues that we should consider on a rewrite.