What happened + What you expected to happen

What happened

When running a simple PPO training and then doing inference the following error occurs:

  File "/home/simon/git-projects/test-rllib/test_rllib/tests/rllib_issue.py", line 88, in <module>
    outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 252, in local_policy_inference
    ac_outputs: List[AgentConnectorsOutput] = policy.agent_connectors(acd_list)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/agent/pipeline.py", line 41, in __call__
    ret = c(ret)
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/connector.py", line 265, in __call__
    return [self.transform(d) for d in acd_list]
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/connector.py", line 265, in <listcomp>
    return [self.transform(d) for d in acd_list]
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/connectors/agent/view_requirement.py", line 118, in transform
    sample_batch = agent_collector.build_for_inference()
  File "/home/simon/git-projects/test-rllib/.venv-nightly/lib/python3.9/site-packages/ray/rllib/evaluation/collectors/agent_collector.py", line 373, in build_for_inference
    element_at_t = d[view_req.shift_arr + len(d) - 1]
IndexError: index -1 is out of bounds for axis 0 with size 0

Debugging the code showed that the problem is that the default view_requirements of the PPO policy contains the "prev_rewards" with shift_arr=array([0]):

On the first inference call the agent_collector is empty and gets initialized; specifically the "rewards" get added via the get_dummy_batch_for_space() in the add_init_obs() method of the collector. This is what you see in the example code below in the first print out of the buffers["rewards"]
On the second inference call, however, the add_action_reward_next_obs() method fo the collector gets called and does not ensure the "prev_reward" to be present.
This can be solved by providing the reward in the local_policy_inference() function (see comments), but then the same problem occurs with the "agent_id".
This can be solved by adding the "agent_index" to the data batch in the local_policy_inference() - which is already provided by the user in the call, but it needs a modification of this function.

Nevertheless, is this something we should provide in inference? Imo it would be easier to provide this then on default as the user does usually not consider the default view requirements of the policy.

What I expected to happen

That not more then the obs are needed to be provided in inference of a non-stateful policy.

Versions / Dependencies

Linux Fedora 37 Python 3.9.12 Ray nightly September 1st 2023 10:15

Reproduction script

import gymnasium as gym
import numpy as np

from pathlib import Path
from ray import air, tune
from ray.rllib.algorithms.ppo.ppo import PPO, PPOConfig
from ray.rllib.utils.policy import local_policy_inference

class OneActionRandomObsTwoRewardEnv(gym.Env):

    def __init__(self, config: dict):
        # This has to be a 1-dimensional space as RLlib's Encoder cannot
        # be defined for a zero-dimensional space.
        self.observation_space = gym.spaces.Box(-1.0, 1.0, (1,), dtype=np.float32)
        self.action_space = gym.spaces.Box(0.0, 0.0, (), dtype=np.float32)

    def reset(self, *, seed: int = None, options: dict = None):
        state = np.random.choice([-1.0, 1.0], (1,))
        self.reward = 1.0 if state > 0.0 else -1.0

        return state, {}

    def step(self, action: np.ndarray):
        state = np.random.choice([-1.0, 1.0], (1,))
        terminated = True
        truncated = False

        return state, self.reward, terminated, truncated, {}

tune.register_env(
    "test_env", lambda ctx: OneActionRandomObsTwoRewardEnv(ctx)
)

config = (
    PPOConfig()
    .environment(
        env="test_env",
    )
    .framework(
        framework="tf2",
        eager_tracing=True,
    )
    .rollouts(
        rollout_fragment_length=100,
    )
    .rl_module(
        _enable_rl_module_api=True,
    )
    .training(
        vf_clip_param=float("inf"),
        train_batch_size=400,
        sgd_minibatch_size=25,
        num_sgd_iter=20,
        _enable_learner_api=True,
        model={
            "fcnet_hiddens": [64, 64],
        },
    )
)

tuner = tune.Tuner(
    "PPO",
    param_space=config,
    run_config=air.RunConfig(
        stop={"training_iteration": 5},
        name="issue_reward_view_req_" + Path(__file__).name.split(".")[0],
    ),
)
result = tuner.fit()
chkpt = result.get_best_result().checkpoint
algo = PPO.from_checkpoint(chkpt)
policy = algo.get_policy()
agent_collector_buffers = policy.agent_connectors.connectors[2].agent_collectors["env_1"]["agent_1"].buffers

print(f"Initial `AgentCollector` buffers: {agent_collector_buffers}")

print("Check obs: -1.0")
obs = np.array([-1.0], dtype=np.float32)

outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
reward_buffer = agent_collector_buffers["rewards"]
print(f"`AgentCollector` buffers: {reward_buffer}")

print("Check obs: 1.0")
obs = np.array([1.0], dtype=np.float32)

outputs = local_policy_inference(policy, "env_1", "agent_1", obs)
# NOTE: The commented code below solves the first exception.
# outputs = local_policy_inference(policy, "env_1", "agent_1", obs, reward = -1.0)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Hey @simonsays1980 , thanks for opening this issue. This is a good one :) The broader take here should be, imo:

Move PPO from RolloutWorker/Policy API to the new EnvRunner (replaces RolloutWorker) + MARLModule (replaces PolicyMap) APIs. See DreamerV3's (albeit single-agent only) EnvRunner under algorithms.dreamerv3.utils.env_runner.py
Make the new EnvRunner: Use gymnasium.vector as the environment API (get rid of RLlib's own quirky Env APIs; again, use DreamerV3 as example) Use Connectors directly in the EnvRunner (<- this could be a phase II). Use DreamerV3's Episode class to store data temporarily (this makes data easily accessible by EnvRunner for compute action calls: forward_exploration/inference()) Pass data from ongoing Episode through Connectors and into RLModules for action computation. The user might configure a custom function that allows them to extract the "correct" data from the Episode given some timestep. This way, we can solve (and get rid of) the conundrum of the TrajectoryViewAPI via a simpler yet more powerful functional API. For example, should the user know that her model requires the last 10 rewards besides the observation, she can write a custom function to extract those data from the ongoing Episode object (and use 0-padding or any other solution for episode-edge cases). (<- this could be phase II) The same happens on the way back to the env: EnvRunner will use the EnvConnector to pass the computed action back to the environment. ** Maybe: Should the module return something from its get_internal_state() method, the EnvRunner might automatically handle RNN-state passing into the module's forward methods as well as storing the most recent state for the next call. Again, see DreamerV3's EnvRunner for a working example of such behavior. (<- this could be phase II; phase I w/o LSTM support)

ray-project / ray

[RLlib] [Connector API]ViewRequirementAgentConnector does not buffer default reward and agent index #39203