I'm currently tuning a model with Ray Tune and Optuna but the cluster I'm working on has issues and sometimes get OOM, making all my current/next runs to fail. I therefore added failure_config=air.FailureConfig(fail_fast=True) for the first OOM to stop the experiment so that I resume it later, however when the experiment stops all current/next runs are marked as terminated, so when restoring the Tuner it does not resume these runs.
Here is the end of the output I get with the example script:
== Status ==
Current time: 2022-11-22 15:02:29 (running for 00:00:58.16)
Memory usage on this node: 9.4/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 48.0/48 CPUs, 4.0/8 GPUs, 0.0/339.25 GiB heap, 0.0/149.38 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (6 PENDING, 4 RUNNING)
+-------------------------+----------+------------------+---------+--------+------------------+---------+
| Trial name | status | loc | val | iter | total time (s) | val |
|-------------------------+----------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00000 | RUNNING | 10.0.7.31:881285 | 19.0407 | 5 | 56.2074 | 3.80815 |
| train_dummy_2a005_00001 | RUNNING | 10.0.7.31:881510 | 18.9348 | 4 | 52.2111 | 4.7337 |
| train_dummy_2a005_00002 | RUNNING | 10.0.7.31:881512 | 18.4179 | 4 | 49.2225 | 4.60448 |
| train_dummy_2a005_00003 | RUNNING | 10.0.7.31:881514 | 10.1868 | 6 | 48.223 | 1.69781 |
| train_dummy_2a005_00004 | PENDING | | 10.3313 | | | |
| train_dummy_2a005_00005 | PENDING | | 12.4223 | | | |
| train_dummy_2a005_00006 | PENDING | | 16.8942 | | | |
| train_dummy_2a005_00007 | PENDING | | 19.8347 | | | |
| train_dummy_2a005_00008 | PENDING | | 17.7972 | | | |
| train_dummy_2a005_00009 | PENDING | | 10.8272 | | | |
+-------------------------+----------+------------------+---------+--------+------------------+---------+
2022-11-22 15:02:29,330 ERROR trial_runner.py:993 -- Trial train_dummy_2a005_00000: Error processing event.
ray.exceptions.RayTaskError(IndexError): ray::ImplicitFunc.train() (pid=881285, ip=10.0.7.31, repr=train_dummy)
File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 355, in train
raise skipped from exception_cause(skipped)
File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 325, in entrypoint
return self._trainable_func(
File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 651, in _trainable_func
output = fn()
File "/home/alain/code/teacher-student/test_restore.py", line 12, in train_dummy
raise IndexError("Dummy error for the example")
IndexError: Dummy error for the example
Result for train_dummy_2a005_00000:
date: 2022-11-22_15-02-29
done: false
experiment_id: 8f09245de949441fad34a02ba982eb6b
experiment_tag: 0_val=19.0407
hostname: bach
iterations_since_restore: 5
node_ip: 10.0.7.31
pid: 881285
time_since_restore: 56.207355260849
time_this_iter_s: 10.021783113479614
time_total_s: 56.207355260849
timestamp: 1669125749
timesteps_since_restore: 0
training_iteration: 5
trial_id: 2a005_00000
val: 3.8081459750683044
warmup_time: 0.0031404495239257812
== Status ==
Current time: 2022-11-22 15:02:29 (running for 00:00:58.37)
Memory usage on this node: 9.3/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/8 GPUs, 0.0/339.25 GiB heap, 0.0/149.38 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (1 ERROR, 9 TERMINATED)
+-------------------------+------------+------------------+---------+--------+------------------+---------+
| Trial name | status | loc | val | iter | total time (s) | val |
|-------------------------+------------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00001 | TERMINATED | 10.0.7.31:881510 | 18.9348 | 4 | 52.2111 | 4.7337 |
| train_dummy_2a005_00002 | TERMINATED | 10.0.7.31:881512 | 18.4179 | 4 | 49.2225 | 4.60448 |
| train_dummy_2a005_00003 | TERMINATED | 10.0.7.31:881514 | 10.1868 | 6 | 48.223 | 1.69781 |
| train_dummy_2a005_00004 | TERMINATED | | 10.3313 | | | |
| train_dummy_2a005_00005 | TERMINATED | | 12.4223 | | | |
| train_dummy_2a005_00006 | TERMINATED | | 16.8942 | | | |
| train_dummy_2a005_00007 | TERMINATED | | 19.8347 | | | |
| train_dummy_2a005_00008 | TERMINATED | | 17.7972 | | | |
| train_dummy_2a005_00009 | TERMINATED | | 10.8272 | | | |
| train_dummy_2a005_00000 | ERROR | 10.0.7.31:881285 | 19.0407 | 5 | 56.2074 | 3.80815 |
+-------------------------+------------+------------------+---------+--------+------------------+---------+
Number of errored trials: 1
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------|
| train_dummy_2a005_00000 | 1 | /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00000_0_val=19.0407_2022-11-22_15-01-31/error.txt |
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
(train_dummy pid=881510) 2022-11-22 15:02:29,380 ERROR worker.py:763 -- Worker exits with an exit code 1.
(train_dummy pid=881510) Traceback (most recent call last):
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 1032, in ray._raylet.task_execution_handler
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 812, in ray._raylet.execute_task
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 852, in ray._raylet.execute_task
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
(train_dummy pid=881510) File "python/ray/_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/_private/function_manager.py", line 674, in actor_method_executor
(train_dummy pid=881510) return method(__ray_actor, *args, **kwargs)
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
(train_dummy pid=881510) return method(self, *_args, **_kwargs)
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 352, in train
(train_dummy pid=881510) result = self.step()
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
(train_dummy pid=881510) return method(self, *_args, **_kwargs)
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 365, in step
(train_dummy pid=881510) result = self._results_queue.get(
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/queue.py", line 180, in get
(train_dummy pid=881510) self.not_empty.wait(remaining)
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/threading.py", line 324, in wait
(train_dummy pid=881510) gotit = waiter.acquire(True, timeout)
(train_dummy pid=881510) File "/home/alain/miniconda3/envs/evaltmp/lib/python3.10/site-packages/ray/_private/worker.py", line 760, in sigterm_handler
(train_dummy pid=881510) sys.exit(1)
(train_dummy pid=881510) SystemExit: 1
2022-11-22 15:02:29,567 ERROR tune.py:773 -- Trials did not complete: [train_dummy_2a005_00000]
2022-11-22 15:02:29,567 INFO tune.py:777 -- Total run time: 59.38 seconds (58.36 seconds for the tuning loop).
Best trial:
Result(metrics={'val': 1.6978080140891973, 'done': False, 'trial_id': '2a005_00003', 'experiment_tag': '3_val=10.1868'}, error=None, log_dir=PosixPath('/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00003_3_val=10.1868_2022-11-22_15-01-32'))
Best trial config:
{'val': 10.186848084535184}
Then, here is what I get when trying to restore the Tuner:
2022-11-22 15:03:05,918 INFO worker.py:1528 -- Started a local Ray instance.
2022-11-22 15:03:07,547 WARNING function_trainable.py:586 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2022-11-22 15:03:08,351 INFO trial_runner.py:601 -- A local experiment checkpoint was found and will be used to restore the previous experiment state.
2022-11-22 15:03:08,352 INFO trial_runner.py:738 -- Using following checkpoint to resume: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/experiment_state-2022-11-22_15-01-30.json
2022-11-22 15:03:08,352 WARNING trial_runner.py:743 -- Attempting to resume experiment from /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26. This will ignore any new changes to the specification.
2022-11-22 15:03:08,375 INFO tune.py:668 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
== Status ==
Current time: 2022-11-22 15:03:08 (running for 00:00:00.01)
Memory usage on this node: 9.0/503.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/8 GPUs, 0.0/341.66 GiB heap, 0.0/150.42 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 2a005_00003 with val=1.6978080140891973 and parameters={'val': 10.186848084535184}
Result logdir: /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26
Number of trials: 10/10 (1 ERROR, 9 TERMINATED)
+-------------------------+------------+------------------+---------+--------+------------------+---------+
| Trial name | status | loc | val | iter | total time (s) | val |
|-------------------------+------------+------------------+---------+--------+------------------+---------|
| train_dummy_2a005_00001 | TERMINATED | 10.0.7.31:881510 | 18.9348 | 4 | 52.2111 | 4.7337 |
| train_dummy_2a005_00002 | TERMINATED | 10.0.7.31:881512 | 18.4179 | 4 | 49.2225 | 4.60448 |
| train_dummy_2a005_00003 | TERMINATED | 10.0.7.31:881514 | 10.1868 | 6 | 48.223 | 1.69781 |
| train_dummy_2a005_00004 | TERMINATED | | 10.3313 | | | |
| train_dummy_2a005_00008 | TERMINATED | | 17.7972 | | | |
| train_dummy_2a005_00007 | TERMINATED | | 19.8347 | | | |
| train_dummy_2a005_00005 | TERMINATED | | 12.4223 | | | |
| train_dummy_2a005_00006 | TERMINATED | | 16.8942 | | | |
| train_dummy_2a005_00009 | TERMINATED | | 10.8272 | | | |
| train_dummy_2a005_00000 | ERROR | 10.0.7.31:881285 | 19.0407 | 5 | 56.2074 | 3.80815 |
+-------------------------+------------+------------------+---------+--------+------------------+---------+
Number of errored trials: 1
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------|
| train_dummy_2a005_00000 | 1 | /home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00000_0_val=19.0407_2022-11-22_15-01-31/error.txt |
+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------+
2022-11-22 15:03:08,495 ERROR tune.py:773 -- Trials did not complete: [train_dummy_2a005_00000]
2022-11-22 15:03:08,495 INFO tune.py:777 -- Total run time: 0.95 seconds (0.00 seconds for the tuning loop).
Best trial:
Result(metrics={'val': 1.6978080140891973, 'done': False, 'trial_id': '2a005_00003', 'experiment_tag': '3_val=10.1868'}, error=None, log_dir=PosixPath('/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26/train_dummy_2a005_00003_3_val=10.1868_2022-11-22_15-01-32'))
Best trial config:
{'val': 10.186848084535184}
I think that would make more sense to enable the user to resume the experiment. Otherwise, is there a way to use the results from an interrupted experiment in a new one, e.g. as prior knowledge for an hyperparameter search?
Versions / Dependencies
Python 3.10
Ray 2.1.0
Ubuntu 22.04
Reproduction script
import random
import time
import ray.air as air
import ray.tune as tune
def train_dummy(config):
v = config["val"]
for i in range(1, 30):
if random.random() < 0.05:
raise IndexError("Dummy error for the example")
time.sleep(random.randint(5, 15))
tune.report(val=v / i)
def train_dummy_models():
config = {'val': tune.uniform(10, 20)}
tuner = tune.Tuner(
tune.with_resources(
train_dummy,
resources={"cpu": 12, "gpu": 1}
),
tune_config=tune.TuneConfig(
num_samples=10,
metric="val",
mode="min"
),
run_config=air.RunConfig(
local_dir="test-restore",
failure_config=air.FailureConfig(fail_fast=True)
),
param_space=config
)
# comment the line below first time you run the script, and replace by the according timestamp when you restore it
tuner = tune.Tuner.restore(path="/home/alain/code/teacher-student/test-restore/train_dummy_2022-11-22_15-01-26")
results = tuner.fit()
best_trial = results.get_best_result()
print("Best trial:")
print(best_trial)
print("Best trial config:")
print(best_trial.config)
if __name__ == "__main__":
train_dummy_models()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
I'm currently tuning a model with Ray Tune and Optuna but the cluster I'm working on has issues and sometimes get OOM, making all my current/next runs to fail. I therefore added
failure_config=air.FailureConfig(fail_fast=True)
for the first OOM to stop the experiment so that I resume it later, however when the experiment stops all current/next runs are marked as terminated, so when restoring theTuner
it does not resume these runs.Here is the end of the output I get with the example script:
Then, here is what I get when trying to restore the
Tuner
:I think that would make more sense to enable the user to resume the experiment. Otherwise, is there a way to use the results from an interrupted experiment in a new one, e.g. as prior knowledge for an hyperparameter search?
Versions / Dependencies
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.