[Bug] KeyError: 'advantages', when using PPO agent with tune.run on multi agent env

wildsky95 commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Hi I'm trying to use PPO with tune.run in a custom multi-agent environment and i get key error: "advantages". is this a bug or how should i solve this. i tried changing the version and the error seems to appear every time. i tried PG agent it runs correctly on my custom environment but with PPO agent i get this error 2021-12-12 21:26:34,243 ERROR trial_runner.py:924 -- Trial PPO_multi_agent_d19e6_00000: Error processing event. Traceback (most recent call last): File "/home/wildsky/Dropbox/NLP_CO/marl_test/test.py", line 93, in <module> tune.run(**exp_dict, fail_fast="raise") File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/tune.py", line 607, in run runner.step() File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 705, in step self._process_events(timeout=timeout) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 866, in _process_events self._process_trial(trial) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 893, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 718, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/worker.py", line 1728, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(KeyError): ray::PPOTrainer.train() (pid=129494, ip=192.168.1.103, repr=PPOTrainer) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/tune/trainable.py", line 315, in train result = self.step() File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 929, in step raise e File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 911, in step result = self.step_attempt() File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 983, in step_attempt step_results = next(self.train_exec_impl) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 756, in __next__ return next(self.built_iterator) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 843, in apply_filter for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 783, in apply_foreach for item in it: File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/util/iter.py", line 791, in apply_foreach result = fn(item) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/rllib/execution/rollout_ops.py", line 266, in __call__ batch[field] = standardized(batch[field]) File "/home/wildsky/My_Venv/DRL/lib/python3.8/site-packages/ray/rllib/policy/sample_batch.py", line 712, in __getitem__ value = dict.__getitem__(self, key) KeyError: 'advantages'

Versions / Dependencies

version 1.8, 1.9, 2.0

Reproduction script


from gym.spaces import space
import ray
from ray import tune
from ray.rllib.agents import ppo, trainer
from ray.rllib.env.multi_agent_env import make_multi_agent
import argparse
from ray.tune.registry import register_env
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.examples.policy.random_policy import RandomPolicy

parser = argparse.ArgumentParser()
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="tf",
    help="The DL framework specifier.")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=20,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=100000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=150.0,
    help="Reward at which we stop training.")

if __name__ =="__main__":
    args = parser.parse_args()
    ma_cartpole_cls = make_multi_agent("CartPole-v0")
    ma_cartpole = ma_cartpole_cls({"num_agents": 2})
    register_env("multi_agent" , lambda x : ma_cartpole)

    act_space = ma_cartpole.action_space
    obs_space = ma_cartpole.observation_space
    print(act_space,"%%%%%%%" , obs_space)

    stop = {
        "training_iteration": args.stop_iters,
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
    }

    config = {

        "env": "multi_agent",
        "multiagent": {

            "policies":{ "ppo_policy":(None, obs_space, act_space, {}),
                        "random": (RandomPolicy, obs_space, act_space, {}),
            },
            "policy_mapping_fn": (
            lambda x, **kwargs: ["ppo_policy", "random"][x % 2]
              ),
            "policies_to_learn":["ppo_policy"]

        },

        "framework": args.framework,
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),

    }

    exp_name = 'multi_test'
    exp_dict = {
        'name': exp_name,
        'run_or_experiment': 'PPO',
        "stop": stop,
        'checkpoint_freq': 20,
        "config": config,
        }
    ray.init()
    tune.run(**exp_dict, fail_fast="raise")

    ray.shutdown()```

sven1977 commented 2 years ago

Hey @wildsky95 , this should be an easy fix. Your "policies_to_learn" should be "policies_to_train".

To explain the error: Your RandomPolicy does not do postprocessing (it does not have a postprocess_trajectory method defined), so advantages are not calculated for its batches.

We should do the following things:

Add a meaningful error in the StandardizeFields code if a field is not found.
Check the keys of the "multiagent" config better for such key errors.

Closing this now, I could confirm your script runs well with the change from "policies_to_learn" to "policies_to_train".

sven1977 commented 2 years ago

@wildsky95 : https://github.com/ray-project/ray/pull/21448

sven1977 commented 2 years ago

Ah, also just noticed that this is RLlib's fault. The example script has this wrong, but doesn't detect it b/c it's using PG (not computing advantages) and not PPO. Fixed in above PR. Thanks again!

ray-project / ray