ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[rllib] Assert agent_key not in self.agent_collectors #15297

Closed GiovanniGatti closed 2 years ago

GiovanniGatti commented 3 years ago

What is the problem?

After a couple of training iterations, the training job crashes with the following error:

 Failure # 1 (occurred at 2021-03-31_11-10-22)
Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): e[36mray::PPO.train_buffered()e[39m (pid=264, ip=10.1.0.8)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/opt/miniconda/lib/python3.7/site-packages/ray/tune/trainable.py", line 167, in train_buffered
    result = self.train()
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 526, in train
    raise e
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 515, in train
    result = Trainable.train(self)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/tune/trainable.py", line 226, in train
    result = self.step()
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 157, in step
    evaluation_metrics = self._evaluate()
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 778, in _evaluate
    for w in self.evaluation_workers.remote_workers()
  File "/opt/miniconda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
ray.exceptions.RayTaskError(AssertionError): e[36mray::RolloutWorker.sample()e[39m (pid=375, ip=10.1.0.8)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 662, in sample
    batches = [self.input_reader.next()]
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 95, in next
    batches = [self.get_data()]
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 224, in get_data
    item = next(self.rollout_provider)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 620, in _env_runner
    sample_collector=sample_collector,
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 1198, in _process_observations_w_trajectory_view_api
    new_episode.length - 1, filtered_obs)
  File "/opt/miniconda/lib/python3.7/site-packages/ray/rllib/evaluation/collectors/simple_list_collector.py", line 487, in add_init_obs
    assert agent_key not in self.agent_collectors
AssertionError

Python version: 3.7.9 OS: ubuntu 18.04 Tensorflow version: 2.4.1 Docker: Docker version 20.10.3, build 48d30b5

Logs: error.txt 70_driver_log.txt

I suspect that the error is caused by the following code:

# episode.py

class MultiAgentEpisode:
   #...

    def __init__(self, policies: Dict[PolicyID, Policy],
                 policy_mapping_fn: Callable[[AgentID], PolicyID],
                 batch_builder_factory: Callable[
                     [], "MultiAgentSampleBatchBuilder"],
                 extra_batch_callback: Callable[[SampleBatchType], None],
                 env_id: EnvID):
        #...
        self.episode_id: int = random.randrange(2e9)
        #...

I'm using a training batch that is generating ~2k episodes/iteration. Assuming that episode ids are independent per training iteration, the job has about (if I'm not wrong in my calculations) ~20% probability of generating at least one conflicting id in first 100 training iterations (due to the Birthday paradox).

Reproduction

I was able to reproduce the error with a the following script:

import ray
from ray import tune

def stop(trial_id, result):
    return result['training_iteration'] >= 1000

DEFAULT_RAY_ADDRESS = 'localhost:6379'

if __name__ == '__main__':
    horizon = 20
    num_workers = 5
    num_envs_per_worker = 128
    trainer_cpus = 11
    eval_workers = 0
    num_eval_ep = 0
    num_episodes = 5 * 3 * num_envs_per_worker * horizon
    train_batch_size = num_episodes * horizon
    sgd_minibatch_size = 1024
    num_sgd_iter = 1
    ray.init(address=DEFAULT_RAY_ADDRESS)

    tune.run(
        run_or_experiment='PPO',
        config={
            # Training settings
            'env': 'Pendulum-v0',
            'model': {
                "fcnet_hiddens": [16]
            },
            'env_config': {},
            'num_workers': num_workers,
            'num_cpus_per_worker': 1,
            'num_envs_per_worker': num_envs_per_worker,
            'rollout_fragment_length': horizon,
            'framework': 'tf2',

            # Continuous Task settings
            'horizon': horizon,
            'soft_horizon': False,
            'no_done_at_end': True,

            # Parallel Training CPU
            'num_cpus_for_driver': trainer_cpus,
            'tf_session_args': {
                'intra_op_parallelism_threads': 0,
                'inter_op_parallelism_threads': 0,
                'log_device_placement': False,
                'device_count': {
                    'CPU': trainer_cpus,
                },
                'allow_soft_placement': True,
            },
            'local_tf_session_args': {
                'intra_op_parallelism_threads': 0,
                'inter_op_parallelism_threads': 0,
            },

            # PPO specific
            'train_batch_size': train_batch_size,
            'sgd_minibatch_size': sgd_minibatch_size,
            'num_sgd_iter': num_sgd_iter,
        },
        stop=stop,
        local_dir='./logs')
GiovanniGatti commented 3 years ago

FIY this issue is being redirect from https://discuss.ray.io/t/assert-agent-key-not-in-self-agent-collectors/1489/6.

GiovanniGatti commented 3 years ago

The exception also raises with Ray 1.1.0 70_driver_log.txt

stefanbschneider commented 3 years ago

I'm running into the same error when using PPO for multi-agent RL (shared policy) with a increasing and then again decreasing number of agents over time. For me, the error only occurs with Ray 1.3 but not with version 1.0. It also does not occur if I have a constant number of agents (or only increasing, not decreasing).

Maybe this already helps a bit to narrow down the bug. If I figure out more details, I'll add them here.


It's a bit difficult to debug; even when running with a debugger and setting local_mode=True in ray.init(), the debugger only stops at the TuneError "Trials did not complete", not at initial assertion error that caused the problem. So I can't see what the values are that cause this assertion error. Any way to debug this effectively?

stefanbschneider commented 3 years ago

I think the error somehow comes from my environment removing agents during an episode. What I currently do is to simply remove the corresponding observations, info, done, reward for the "removed" agent. Do I need to actively remove/de-register the agent from RLlib somehow? @sven1977 Any idea?

For now, I'll roll back to v1.0 to temporarily fix the problem for me.

JSchapke commented 3 years ago

I also been having the same error. As far as I see, the error seems to occur when the length of the environment's episode is smaller than the 'rollout_fragment_length' parameter of algorithms such as PPO. It seems that whenever this occurs it's possible that a new episode starts with the same episode_id of the previous one (here, I refer to the 'episode_id' attribute of the Episode object which gets passed to the add_init_obs function). Then the agent_key variable, which is a tuple of the policy_id and episode_id, is the same from the previous episode and causes the error.

GiovanniGatti commented 3 years ago

As I wrote before, I suspect that the issue is caused by the birthday paradox. Using self.episode_id: int = random.randrange(2e9) for generating ids is a way too small range. The workaround I implemented (and is working fairly well from 2 months now) was to reduce the number of episodes generated at each training iteration.

If my theory is correct, it also explains the randomness of the error. The conflicting ids are a question of probability. It may never occur if the training is too short or the number of episodes per training iteration is too small. The only way to reliably reproduce it is to run a very long training with a lot of episodes in each training iteration.

If turns that this is a birthday paradox issue, I would suggest using a reasonably long hash generated with murmur to episode ids.

stefanbschneider commented 3 years ago

@GiovanniGatti In my multi-agent scenario with first increasing, then decreasing number of active agents the error is reproducible - so it's not (just) due to random IDs from a too small range. In my case, there seems to be another issue.

@JSchapke Is it possible that this is somehow triggered by suspending previously active RL agents?

moamenibrahim commented 3 years ago

Has this been solved or worked on? I am facing the same problem now in a multi-agent setup but in add_action_reward_next_obs() function, line 524 with same line assert agent_key in self.agent_collectors and it happens basically in the second training iteration.

Error stack:

File "\venv\lib\site-packages\ray\_private\function_manager.py", line 556, in actor_method_executor
  return method(__ray_actor, *args, **kwargs)
File "\venv\lib\site-packages\ray\util\iter.py", line 1158, in par_iter_next_batch
  batch.append(self.par_iter_next())
File "\venv\lib\site-packages\ray\util\iter.py", line 1152, in par_iter_next
  return next(self.local_it)
File "\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 332, in gen_rollouts
  yield self.sample()
File "\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 720, in sample
  batch = self.input_reader.next()
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 96, in next
  batches = [self.get_data()]
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 223, in get_data
  item = next(self.rollout_provider)
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 596, in _env_runner
  _process_observations(
File "\venv\lib\site-packages\ray\rllib\evaluation\sampler.py", line 852, in _process_observations
  sample_collector.add_action_reward_next_obs(
File "\venv\lib\site-packages\ray\rllib\evaluation\collectors\simple_list_collector.py", line 524, in add_action_reward_next_obs
  assert agent_key in self.agent_collectors
AssertionError
vonHartz commented 3 years ago

I stumbled over the problem as well when trying to run the SAC Halfcheetah config from the tuned examples in the repo. Happened seemingly randomly after 762 iters.

rollout_fragment_length is 1, speaking against your hypothesis, @JSchapke, and in favor of @GiovanniGatti 's theory.

I'm running Ray 1.3.0, so this might or might not have been fixed by now.

GiovanniGatti commented 3 years ago

@vonHartz, I've been experiencing this issue regularly in Ray 1.4.0. I don't think it has been fixed by now because I don't see anything addressing it in the release notes.

As an workaround one can do: if the job fails and the reason is because assert key in self.agent_collects, relaunch the training job from the latest checkpoint. In Python,

    ray.init()
    analysis = run_training()
    while True:
        relaunch = False
        incomplete_trials = []
        for trial in analysis.trials:
            if trial.status == experiment_analysis.Trial.ERROR:
                if 'assert agent_key not in self.agent_collectors' in trial.error_msg:
                    relaunch = True
                else:
                    incomplete_trials.append(trial)

        if incomplete_trials:
            raise tune.TuneError("Trials did not complete", incomplete_trials)

        if relaunch:
            analysis = run_training() #here you can automatically detect and load the latest checkpoint with tune
            continue
        break

I'm not proud of it, but it works. In the worst case scenario, if you checkpoint regularly, you just lose a couple of training iterations.

ekblad commented 3 years ago

I am running into this traceback, with IMPALA after 500000 episodes exactly, independent of episode length. I am running 100 workers, and weirdly, this issue only popped up after I went from 250>500 envs per worker.

edit: I was running >500k episodes with the 100 workers/250 envs per worker config.

stefanbschneider commented 2 years ago

As I wrote before, I suspect that the issue is caused by the birthday paradox. Using self.episode_id: int = random.randrange(2e9) for generating ids is a way too small range. The workaround I implemented (and is working fairly well from 2 months now) was to reduce the number of episodes generated at each training iteration.

If my theory is correct, it also explains the randomness of the error. The conflicting ids are a question of probability. It may never occur if the training is too short or the number of episodes per training iteration is too small. The only way to reliably reproduce it is to run a very long training with a lot of episodes in each training iteration.

If turns that this is a birthday paradox issue, I would suggest using a reasonably long hash generated with murmur to episode ids.

I still have this issue every now and then in Ray 1.8. Since it only happens sporadically, I suspect @GiovanniGatti to be right (quoted). For now, I simply rerun the broken training runs, but would be great if there's a fix. I'll try to upgrade to Ray 1.12 and see if the issue persists.

matteobettini commented 2 years ago

Hello,

I run into the same issue in ray 1.12.1, so it has not been fixed. To me it happens just with a single agent environment.