ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.87k stars 5.76k forks source link

[rrlib] Catch timeout error for Impala and improve error message #8301

Closed vivecalindahlericsson closed 3 years ago

vivecalindahlericsson commented 4 years ago

When running Impala with a busy scheduling of new simulations (short horizon, many resets), I got the following assertion error

2020-04-30 16:10:32,923 WARNING worker.py:816 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
2020-04-30 16:16:01,818 ERROR trial_runner.py:512 -- Trial IMPALA_MultiAgentEnv_00000: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 458, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 381, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1526, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::IMPALA.train() (pid=25776, ip=192.168.202.97)
  File "python/ray/_raylet.pyx", line 445, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 423, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 491, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 154, in _train
    fetches = self.optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/async_samples_optimizer.py", line 130, in step
    assert self.learner.is_alive()
AssertionError
Traceback (most recent call last):
  File "test_tune_run.py", line 92, in <module>
    "on_episode_end": on_episode_end,
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 337, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [IMPALA_MultiAgentEnv_00000])

After a significant amount of checking code, trial and error debugging, I realized that the source of the problems is a queue-related timeout that should lead to an error (I think) but for some reason the assertion error is what is seen. There is also a trainer config parameter for the timeout "learner_queue_timeout": 300.

It would be helpful to catch the timeout earlier and provide a more helpful error message, e.g. pointing to the timeout parameter.

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!