ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.62k forks source link

[Bug] `policies_to_train` throws incorrect/confusing error message when passed an empty list. #23646

Open nyielding opened 2 years ago

nyielding commented 2 years ago

Search before asking

Ray Component

RLlib

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

config.multiagent.policies_to_train excepts and returns an incorrect/confusing error message when passed an empty list while training with PPO.

In the instance of training where all policies are random/heuristic/etc and none are meant to be trained, according to config docs an empty list should be passed for "policies_to_train". In this case the error message returned is:

KeyError: '`advantages` not found in SampleBatch for policy `random`! Maybe this policy fails to add advantages in its `postprocess_trajectory` method? Or this policy is not meant to learn at all and you forgot to add it to the list under `config.multiagent.policies_to_train`.'

but intuitively if "this policy is not meant to learn at all" you should NOT add it to the list under config.multiagent.policies_to_train.'

The workaround is to pass a list containing a string that is anything that does not match a policy name ie if your policy is called 'random' then pass config.multiagent.policies_to_train = ['any_str_but_random'] and the code seems to run as intended.

This could be the empty list being treated the same as None (which defaults to training all policies) but I haven't traced it to that.

Similar to previous closed issue https://github.com/ray-project/ray/issues/21044

Versions / Dependencies

ray 1.11.0 python 3.8.10 ubuntu 20.04 LTS

Reproduction script

from gym.spaces import space
import ray
import os
from ray import tune
from ray.rllib.agents import ppo, trainer
from ray.rllib.env.multi_agent_env import make_multi_agent
import argparse
from ray.tune.registry import register_env
from ray.rllib.policy.policy import PolicySpec
from ray.rllib.examples.policy.random_policy import RandomPolicy

parser = argparse.ArgumentParser()
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="tf",
    help="The DL framework specifier.")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=20,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=100000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=150.0,
    help="Reward at which we stop training.")

if __name__ =="__main__":
    args = parser.parse_args()
    ma_cartpole_cls = make_multi_agent("CartPole-v0")
    ma_cartpole = ma_cartpole_cls({"num_agents": 2})
    register_env("multi_agent" , lambda x : ma_cartpole)

    act_space = ma_cartpole.action_space
    obs_space = ma_cartpole.observation_space
    print(act_space,"%%%%%%%" , obs_space)

    stop = {
        "training_iteration": args.stop_iters,
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
    }

    config = {

        "env": "multi_agent",
        "multiagent": {

            "policies":{
                "random": (RandomPolicy, obs_space, act_space, {}),
            },
            "policy_mapping_fn": (
            lambda x, **kwargs: "random"
              ),
            "policies_to_train":[],
            # "policies_to_train":['any_str_but_random'] # try this instead and it works

        },

        "framework": args.framework,
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),

    }

    exp_name = 'multi_test'
    exp_dict = {
        'name': exp_name,
        'run_or_experiment': 'PPO',
        "stop": stop,
        'checkpoint_freq': 20,
        "config": config,
        }
    ray.init()
    tune.run(**exp_dict, fail_fast="raise")

    ray.shutdown()

Anything else

No response

Are you willing to submit a PR?

gjoliver commented 2 years ago

can you clarify what is the use case here? why are we running an RLlib stack without training any policy?

nyielding commented 2 years ago

can you clarify what is the use case here? why are we running an RLlib stack without training any policy?

I'm training agents to do guidance in a GNC environment, and I want to have both trained policies and custom policies that provide heuristic guidance actions via traditional control methods for comparison. It is a multiagent cooperative environment, so for baselines and comparisons I don't want to always mix trained and heuristic policies in the same episodes.

So I would like to be able to run short experiments with the exact same parameters as my training experiment, but with the heuristic 'dummy' policies, so I can get 1:1 results comparison from all my custom metrics and recording callbacks I have implemented.

gjoliver commented 2 years ago

oh, if I understand correctly, is this a use case of "evaluate"? supposedly, we provide rllib/evaluate.py which would do rollout using trained and whatever policies you want. I am not sure if that covers your use case.

nyielding commented 2 years ago

oh, if I understand correctly, is this a use case of "evaluate"? supposedly, we provide rllib/evaluate.py which would do rollout using trained and whatever policies you want. I am not sure if that covers your use case.

I suppose you could break it out into an 'evaluate' use case. I have seen the evaluate.py file but admittedly I haven't taken the time/effort to extend it to work with my experiment setup. The way we build up large config files and custom environments doesn't fit neatly and easily into that script, as opposed to just calling tune.run again with a different policy loadout. Building out proper evaluation scripts for my experiments is on my backlog.

But regardless, I think the current behavior is not intended, and the workaround I detailed seems to produce what I would think is the intended behavior. The workaround was not obvious though, if anyone else tries this and runs into the issue, but I can fall back on using it for now.

gjoliver commented 2 years ago

ok, thanks for the clarification. we will keep this in our backlog of things to clean up. agree that the workaround is not good.