ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.07k stars 5.79k forks source link

[Tune] `fail_fast` marks all runs as terminated, making the experiment impossible to restore #30584

Open aRI0U opened 1 year ago

aRI0U commented 1 year ago

What happened + What you expected to happen

I'm currently tuning a model with Ray Tune and Optuna but the cluster I'm working on has issues and sometimes get OOM, making all my current/next runs to fail. I therefore added failure_config=air.FailureConfig(fail_fast=True) for the first OOM to stop the experiment so that I resume it later, however when the experiment stops all current/next runs are marked as terminated, so when restoring the Tuner it does not resume these runs.

Here is the end of the output I get with the example script:

== Status ==
Current time: 2022-11-22 15:02:29 (running for 00:00:58.16)
Memory usage on this node: 9.4/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 48.0/48 CPUs, 4.0/8 GPUs, 0.0/339.25 GiB heap, 0.0/149.38 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (6 PENDING, 4 RUNNING)
+-------------------------+----------+------------------+---------+--------+------------------+---------+
| Trial name              | status   | loc              |     val |   iter |   total time (s) |     val |
|-------------------------+----------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00000 | RUNNING  | 10.0.7.31:881285 | 19.0407 |      5 |          56.2074 | 3.80815 |
| train_dummy_2a005_00001 | RUNNING  | 10.0.7.31:881510 | 18.9348 |      4 |          52.2111 | 4.7337  |
| train_dummy_2a005_00002 | RUNNING  | 10.0.7.31:881512 | 18.4179 |      4 |          49.2225 | 4.60448 |
| train_dummy_2a005_00003 | RUNNING  | 10.0.7.31:881514 | 10.1868 |      6 |          48.223  | 1.69781 |
| train_dummy_2a005_00004 | PENDING  |                  | 10.3313 |        |                  |         |
| train_dummy_2a005_00005 | PENDING  |                  | 12.4223 |        |                  |         |
| train_dummy_2a005_00006 | PENDING  |                  | 16.8942 |        |                  |         |
| train_dummy_2a005_00007 | PENDING  |                  | 19.8347 |        |                  |         |
| train_dummy_2a005_00008 | PENDING  |                  | 17.7972 |        |                  |         |
| train_dummy_2a005_00009 | PENDING  |                  | 10.8272 |        |                  |         |
+-------------------------+----------+------------------+---------+--------+------------------+---------+

2022-11-22 15:02:29,330 ERROR trial_runner.py:993 -- Trial train_dummy_2a005_00000: Error processing event.
ray.exceptions.RayTaskError(IndexError): ray::ImplicitFunc.train() (pid=881285, ip=10.0.7.31, repr=train_dummy)
  File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 325, in entrypoint
    return self._trainable_func(
  File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 651, in _trainable_func
    output = fn()
  File "/home/alain/code/teacher-student/test_restore.py", line 12, in train_dummy
    raise IndexError("Dummy error for the example")  
IndexError: Dummy error for the example
Result for train_dummy_2a005_00000:
  date: 2022-11-22_15-02-29
  done: false
  experiment_id: 8f09245de949441fad34a02ba982eb6b
  experiment_tag: 0_val=19.0407
  hostname: bach
  iterations_since_restore: 5
  node_ip: 10.0.7.31
  pid: 881285
  time_since_restore: 56.207355260849
  time_this_iter_s: 10.021783113479614
  time_total_s: 56.207355260849
  timestamp: 1669125749
  timesteps_since_restore: 0
  training_iteration: 5
  trial_id: 2a005_00000
  val: 3.8081459750683044
  warmup_time: 0.0031404495239257812

== Status ==
Current time: 2022-11-22 15:02:29 (running for 00:00:58.37)
Memory usage on this node: 9.3/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/8 GPUs, 0.0/339.25 GiB heap, 0.0/149.38 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (1 ERROR, 9 TERMINATED)
+-------------------------+------------+------------------+---------+--------+------------------+---------+
| Trial name              | status     | loc              |     val |   iter |   total time (s) |     val |
|-------------------------+------------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00001 | TERMINATED | 10.0.7.31:881510 | 18.9348 |      4 |          52.2111 | 4.7337  |
| train_dummy_2a005_00002 | TERMINATED | 10.0.7.31:881512 | 18.4179 |      4 |          49.2225 | 4.60448 |
| train_dummy_2a005_00003 | TERMINATED | 10.0.7.31:881514 | 10.1868 |      6 |          48.223  | 1.69781 |
| train_dummy_2a005_00004 | TERMINATED |                  | 10.3313 |        |                  |         |
| train_dummy_2a005_00005 | TERMINATED |                  | 12.4223 |        |                  |         |
| train_dummy_2a005_00006 | TERMINATED |                  | 16.8942 |        |                  |         |
| train_dummy_2a005_00007 | TERMINATED |                  | 19.8347 |        |                  |         |
| train_dummy_2a005_00008 | TERMINATED |                  | 17.7972 |        |                  |         |
| train_dummy_2a005_00009 | TERMINATED |                  | 10.8272 |        |                  |         |
| train_dummy_2a005_00000 | ERROR      | 10.0.7.31:881285 | 19.0407 |      5 |          56.2074 | 3.80815 |
+-------------------------+------------+------------------+---------+--------+------------------+---------+
Number of errored trials: 1
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name              |   # failures | error file                                                                                                                                        |
|-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------|
| train_dummy_2a005_00000 |            1 | /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00000_0_val=19.0407_2022-11-22_15-01-31/error.txt |
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

(train_dummy pid=881510) 2022-11-22 15:02:29,380        ERROR worker.py:763 -- Worker exits with an exit code 1.
(train_dummy pid=881510) Traceback (most recent call last):
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 1032, in ray._raylet.task_execution_handler
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 812, in ray._raylet.execute_task
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 852, in ray._raylet.execute_task
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
(train_dummy pid=881510)   File "python/ray/_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
(train_dummy pid=881510)     return method(__ray_actor, *args, **kwargs)
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
(train_dummy pid=881510)     return method(self, *_args, **_kwargs)
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 352, in train
(train_dummy pid=881510)     result = self.step()
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
(train_dummy pid=881510)     return method(self, *_args, **_kwargs)
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 365, in step
(train_dummy pid=881510)     result = self._results_queue.get(
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/queue.py", line 180, in get
(train_dummy pid=881510)     self.not_empty.wait(remaining)
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/threading.py", line 324, in wait
(train_dummy pid=881510)     gotit = waiter.acquire(True, timeout)
(train_dummy pid=881510)   File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/_private/worker.py", line 760, in sigterm_handler
(train_dummy pid=881510)     sys.exit(1)
(train_dummy pid=881510) SystemExit: 1
2022-11-22 15:02:29,567 ERROR tune.py:773 -- Trials did not complete: [train_dummy_2a005_00000]
2022-11-22 15:02:29,567 INFO tune.py:777 -- Total run time: 59.38 seconds (58.36 seconds for the tuning loop).
Best trial:
Result(metrics={'val': 1.6978080140891973, 'done': False, 'trial_id': '2a005_00003', 'experiment_tag': '3_val=10.1868'}, error=None, log_dir=PosixPath('/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00003_3_val=10.1868_2022-11-22_15-01-32'))
Best trial config:
{'val': 10.186848084535184}

Then, here is what I get when trying to restore the Tuner:

2022-11-22 15:03:05,918 INFO worker.py:1528 -- Started a local Ray instance.
2022-11-22 15:03:07,547 WARNING function_trainable.py:586 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2022-11-22 15:03:08,351 INFO trial_runner.py:601 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.
2022-11-22 15:03:08,352 INFO trial_runner.py:738 -- Using following checkpoint to resume: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/experiment_state-2022-11-22_15-01-30.json
2022-11-22 15:03:08,352 WARNING trial_runner.py:743 -- Attempting to resume experiment from /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26. This will ignore any new changes to the specification.
2022-11-22 15:03:08,375 INFO tune.py:668 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
== Status ==
Current time: 2022-11-22 15:03:08 (running for 00:00:00.01)
Memory usage on this node: 9.0/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/8 GPUs, 0.0/341.66 GiB heap, 0.0/150.42 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (1 ERROR, 9 TERMINATED)
+-------------------------+------------+------------------+---------+--------+------------------+---------+
| Trial name              | status     | loc              |     val |   iter |   total time (s) |     val |
|-------------------------+------------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00001 | TERMINATED | 10.0.7.31:881510 | 18.9348 |      4 |          52.2111 | 4.7337  |
| train_dummy_2a005_00002 | TERMINATED | 10.0.7.31:881512 | 18.4179 |      4 |          49.2225 | 4.60448 |
| train_dummy_2a005_00003 | TERMINATED | 10.0.7.31:881514 | 10.1868 |      6 |          48.223  | 1.69781 |
| train_dummy_2a005_00004 | TERMINATED |                  | 10.3313 |        |                  |         |
| train_dummy_2a005_00008 | TERMINATED |                  | 17.7972 |        |                  |         |
| train_dummy_2a005_00007 | TERMINATED |                  | 19.8347 |        |                  |         |
| train_dummy_2a005_00005 | TERMINATED |                  | 12.4223 |        |                  |         |
| train_dummy_2a005_00006 | TERMINATED |                  | 16.8942 |        |                  |         |
| train_dummy_2a005_00009 | TERMINATED |                  | 10.8272 |        |                  |         |
| train_dummy_2a005_00000 | ERROR      | 10.0.7.31:881285 | 19.0407 |      5 |          56.2074 | 3.80815 |
+-------------------------+------------+------------------+---------+--------+------------------+---------+
Number of errored trials: 1
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name              |   # failures | error file                                                                                                                                        |
|-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------|
| train_dummy_2a005_00000 |            1 | /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00000_0_val=19.0407_2022-11-22_15-01-31/error.txt |
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+

2022-11-22 15:03:08,495 ERROR tune.py:773 -- Trials did not complete: [train_dummy_2a005_00000]
2022-11-22 15:03:08,495 INFO tune.py:777 -- Total run time: 0.95 seconds (0.00 seconds for the tuning loop).
Best trial:
Result(metrics={'val': 1.6978080140891973, 'done': False, 'trial_id': '2a005_00003', 'experiment_tag': '3_val=10.1868'}, error=None, log_dir=PosixPath('/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00003_3_val=10.1868_2022-11-22_15-01-32'))
Best trial config:
{'val': 10.186848084535184}

I think that would make more sense to enable the user to resume the experiment. Otherwise, is there a way to use the results from an interrupted experiment in a new one, e.g. as prior knowledge for an hyperparameter search?

Versions / Dependencies

Reproduction script

import random
import time

import ray.air as air
import ray.tune as tune

def train_dummy(config):
    v = config["val"]
    for i in range(1, 30):
        if random.random() < 0.05:
            raise IndexError("Dummy error for the example")
        time.sleep(random.randint(5, 15))
        tune.report(val=v / i)

def train_dummy_models():
    config = {'val': tune.uniform(10, 20)}

    tuner = tune.Tuner(
        tune.with_resources(
            train_dummy,
            resources={"cpu": 12, "gpu": 1}
        ),
        tune_config=tune.TuneConfig(
            num_samples=10,
            metric="val",
            mode="min"
        ),
        run_config=air.RunConfig(
            local_dir="test-restore",
            failure_config=air.FailureConfig(fail_fast=True)
        ),
        param_space=config
    )

    # comment the line below first time you run the script, and replace by the according timestamp when you restore it
    tuner = tune.Tuner.restore(path="/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26")
    results = tuner.fit()

    best_trial = results.get_best_result()
    print("Best trial:")
    print(best_trial)
    print("Best trial config:")
    print(best_trial.config)

if __name__ == "__main__":
    train_dummy_models()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

paehal commented 1 year ago

I am probably running into this problem too and cannot RESUME. Have you made any progress, or is it improved in ray 2.4?