ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.31k stars 5.83k forks source link

[RLlib] [Connector API]ViewRequirementAgentConnector does not buffer default reward and agent index #39203

Open simonsays1980 opened 1 year ago

simonsays1980 commented 1 year ago

What happened + What you expected to happen

What happened

When running a simple PPO training and then doing inference the following error occurs:

  File "/home/simon/git-projects/test-rllib/test_rllib/tests/rllib_issue.py", line 88, in <module>
    outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 252, in local_policy_inference
    ac_outputs: List[AgentConnectorsOutput] = policy.agent_connectors(acd_list)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/agent/pipeline.py", line 41, in __call__
    ret = c(ret)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/connector.py", line 265, in __call__
    return [self.transform(d) for d in acd_list]
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/connector.py", line 265, in <listcomp>
    return [self.transform(d) for d in acd_list]
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/agent/view_requirement.py", line 118, in transform
    sample_batch = agent_collector.build_for_inference()
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 373, in build_for_inference
    element_at_t = d[view_req.shift_arr + len(d) - 1]
IndexError: index -1 is out of bounds for axis 0 with size 0

Debugging the code showed that the problem is that the default view_requirements of the PPO policy contains the "prev_rewards" with shift_arr=array([0]):

Nevertheless, is this something we should provide in inference? Imo it would be easier to provide this then on default as the user does usually not consider the default view requirements of the policy.

What I expected to happen

That not more then the obs are needed to be provided in inference of a non-stateful policy.

Versions / Dependencies

Linux Fedora 37 Python 3.9.12 Ray nightly September 1st 2023 10:15

Reproduction script

import gymnasium as gym
import numpy as np

from pathlib import Path
from ray import air, tune
from ray.rllib.algorithms.ppo.ppo import PPO, PPOConfig
from ray.rllib.utils.policy import local_policy_inference

class OneActionRandomObsTwoRewardEnv(gym.Env):

    def __init__(self, config: dict):
        # This has to be a 1-dimensional space as RLlib's Encoder cannot
        # be defined for a zero-dimensional space.
        self.observation_space = gym.spaces.Box(-1.0, 1.0, (1,), dtype=np.float32)
        self.action_space = gym.spaces.Box(0.0, 0.0, (), dtype=np.float32)

    def reset(self, *, seed: int = None, options: dict = None):
        state = np.random.choice([-1.0, 1.0], (1,))
        self.reward = 1.0 if state > 0.0 else -1.0

        return state, {}

    def step(self, action: np.ndarray):
        state = np.random.choice([-1.0, 1.0], (1,))
        terminated = True
        truncated = False

        return state, self.reward, terminated, truncated, {}

tune.register_env(
    "test_env", lambda ctx: OneActionRandomObsTwoRewardEnv(ctx)
)

config = (
    PPOConfig()
    .environment(
        env="test_env",
    )
    .framework(
        framework="tf2",
        eager_tracing=True,
    )
    .rollouts(
        rollout_fragment_length=100,
    )
    .rl_module(
        _enable_rl_module_api=True,
    )
    .training(
        vf_clip_param=float("inf"),
        train_batch_size=400,
        sgd_minibatch_size=25,
        num_sgd_iter=20,
        _enable_learner_api=True,
        model={
            "fcnet_hiddens": [64, 64],
        },
    )
)

tuner = tune.Tuner(
    "PPO",
    param_space=config,
    run_config=air.RunConfig(
        stop={"training_iteration": 5},
        name="issue_reward_view_req_" + Path(__file__).name.split(".")[0],
    ),
)
result = tuner.fit()
chkpt = result.get_best_result().checkpoint
algo = PPO.from_checkpoint(chkpt)
policy = algo.get_policy()
agent_collector_buffers = policy.agent_connectors.connectors[2].agent_collectors["env_1"]["agent_1"].buffers

print(f"Initial `AgentCollector` buffers: {agent_collector_buffers}")

print("Check obs: -1.0")
obs = np.array([-1.0], dtype=np.float32)

outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
reward_buffer = agent_collector_buffers["rewards"]
print(f"`AgentCollector` buffers: {reward_buffer}")

print("Check obs: 1.0")
obs = np.array([1.0], dtype=np.float32)

outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
# NOTE: The commented code below solves the first exception.
# outputs = local_policy_inference(policy, "env_1", "agent_1", obs, reward = -1.0)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

simonsays1980 commented 1 year ago

@gjoliver As you are fluent in the connectors, what could be the best solution here?

sven1977 commented 1 year ago

Hey @simonsays1980 , thanks for opening this issue. This is a good one :) The broader take here should be, imo: