We use Ray Tune for our model training through Ray Clusters which are AWS spot instances. The error happened with a long running Ray Tune job on a cluster (after 2 days of running). It happened only once so far. Everything in that regard used to work properly so it's tough to reproduce even with the same config.
I checked another issue referencing the same error, but the fixes didn't apply to my case: in my case, I had no issues with IP addresses.
Relevant logs from Ray Tune:
(automl_entry pid=5893) == Status ==
(automl_entry pid=5893) Current time: 2024-01-31 10:18:42 (running for 1 days, 17:49:28.17)
(automl_entry pid=5893) Using AsyncHyperBand: num_stopped=0
(automl_entry pid=5893) Bracket: Iter 5120.000: None | Iter 1280.000: None | Iter 320.000: None | Iter 80.000: None | Iter 20.000: None
(automl_entry pid=5893) Logical resource usage: 64.0/8 CPUs, 7.920000000000001/0 GPUs (0.0/1.0 cpu_entry_token)
(automl_entry pid=5893) Result logdir: /mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs
(automl_entry pid=5893) Number of trials: 101/infinite (8 RUNNING, 93 TERMINATED)
(automl_entry pid=5893) Number of errored trials: 34
(automl_entry pid=5893) +--------------+--------------+--------------+
(automl_entry pid=5893) | Trial name | # failures | error file |
(automl_entry pid=5893) |--------------+--------------+--------------|
(automl_entry pid=5893) +--------------+--------------+--------------+
(automl_entry pid=5893)
(_LoggingActor pid=9501) Dumping logs...
(call_normal pid=40878, ip=10.102.194.112) ====== [dummy model server return] chip_model_memory_dict: {'ram': {'used': 7812276.0, 'available': 5997620.0, 'activations': 159.89453125}, 'rom': {'used': 31999568.0, 'available': 38137396.0}} ======
(automl_entry pid=5893) Got trial: 303, Trial_ac144965
(automl_entry pid=5893) Finished processing trial: 303, Trial_ac144965
(_LoggingActor pid=9501) Stopping logger timer
Traceback (most recent call last):
File "/tmp/ray/session_2024-01-29_14-25-59_628020_3300/runtime_resources/working_dir_files/_ray_pkg_f54f7dd5e076a8f4/automl_entry.py", line 77, in <module>
args.dataset_dir = ray.get(ray_handle, timeout=timeout)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/worker.py", line 2540, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::automl_entry() (pid=5893, ip=10.102.195.73)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 735, in _on_training_result
self._process_trial_results(trial, result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 805, in _process_trial_result
self._callbacks.on_trial_result(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/callback.py", line 392, in on_trial_result
callback.on_trial_result(**info)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 857, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 811, in _sync_trial_dir
sync_process.wait()
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 297, in wait
raise exception
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 260, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 172, in _sync_dir_between_different_nodes
num_cpus=0, **_force_on_node(source_node_id)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'
During handling of the above exception, another exception occurred:
ray::automl_entry() (pid=5893, ip=10.102.195.73)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 712, in _fit_internal
analysis = run(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tune.py", line 1070, in run
runner.step()
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 256, in step
if not self._actor_manager.next(timeout=0.1):
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 224, in next
self._actor_task_events.resolve_future(future)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
on_result(result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 752, in on_result
self._actor_task_resolved(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 300, in _actor_task_resolved
tracked_actor_task._on_result(tracked_actor, result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 824, in _on_result
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 735, in _on_training_result
self._process_trial_results(trial, result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 805, in _process_trial_result
self._callbacks.on_trial_result(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/callback.py", line 392, in on_trial_result
callback.on_trial_result(**info)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 857, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 811, in _sync_trial_dir
sync_process.wait()
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 297, in wait
raise exception
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 260, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 172, in _sync_dir_between_different_nodes
num_cpus=0, **_force_on_node(source_node_id)
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'
The above exception was the direct cause of the following exception:
ray::automl_entry() (pid=5893, ip=10.102.195.73)
File "/tmp/ray/session_2024-01-29_14-25-59_628020_3300/runtime_resources/working_dir_files/_ray_pkg_f54f7dd5e076a8f4/automl_entry.py", line 29, in automl_entry
return autoML_cluster_kickoff(working_dir, dataset_dir, oauth_token=oauth_token, fake_run=False)
File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 426, in autoML_cluster_kickoff
_autoML_condition(
File "/home/ubuntu/lib/AutoMLTraining/automl/Utils/logger.py", line 165, in __exit__
raise value
File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 426, in autoML_cluster_kickoff
_autoML_condition(
File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 423, in _autoML_condition
autoML_condition(*args, **kwargs)
File "/home/ubuntu/lib/AutoMLTraining/automl/automl_main.py", line 357, in autoML_condition
tuner.fit()
File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tuner.py", line 349, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs", trainable=...)`.
I'm struggling to reproduce the issue myself. I'm not looking for extensive coding help, just for some advice on figuring out if we should stop doing something we're doing currently to prevent the bug from occurring again.
However, I will share some more specifics of what happened before the job crashed:
The autoscaler removed a node:
(autoscaler +43h35m31s) Removing 1 nodes of type ray.worker.gpu_8cpu_1gpu_g5 (idle).autoscaler +43h35m41s) Resized to 64 CPUs, 7 GPUs.
Another node was quickly added by the autoscaler:
(autoscaler +43h35m56s) Resized to 72 CPUs, 8 GPUs.
Since the previous worker crashed (probably a spot instance repossessed by AWS), we load from the last checkpoint:
(call_normal pid=40878, ip=10.102.194.112) Loading from last checkpoint...
(call_normal pid=40878, ip=10.102.194.112) Loaded model weights from /mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs/Trial_ac144965_67_anchor_aspect_ratios=0_5_1_0_2_0,anchor_grid_split_xy=1,anchor_matching_iou=0.3500,anchor_scale_min_max=0_1_0_9,_2024-01-30_17-06-39/checkpoint_tmp2a67c5
(call_normal pid=40878, ip=10.102.194.112) Loading from last epoch: 66 with learning rate: 0.00010940543143078685
We download a model using Keras:
(call_normal pid=3542, ip=10.102.194.97) Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_5_0_224_tf.h5
We rsync the dataset from the head node to the worker node:
(call_normal pid=3542, ip=10.102.194.97) rsyncing datasetcoco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890from remote: from ubuntu@10.102.195.73:/opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890 to /opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890
Since the error logs mention something about syncing between different nodes, do you think I should somehow disable this since we rsync the dataset manually and the trial dir is on a shared network drive (AWS EFS)? Thanks in advance for any hints and I'll try to reproduce the error soon.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
@vedin-eta would it be possible for you to update to a new version of Ray (e.g. 2.9)? This syncing logic has since been revamped significantly and should no longer rely on the previous IP. cc @justinvyu
What happened + What you expected to happen
We use Ray Tune for our model training through Ray Clusters which are AWS spot instances. The error happened with a long running Ray Tune job on a cluster (after 2 days of running). It happened only once so far. Everything in that regard used to work properly so it's tough to reproduce even with the same config.
I checked another issue referencing the same error, but the fixes didn't apply to my case: in my case, I had no issues with IP addresses.
Relevant logs from Ray Tune:
Versions / Dependencies
Main dependency versions:
Other libraries:
Reproduction script
I'm struggling to reproduce the issue myself. I'm not looking for extensive coding help, just for some advice on figuring out if we should stop doing something we're doing currently to prevent the bug from occurring again.
However, I will share some more specifics of what happened before the job crashed:
(autoscaler +43h35m31s) Removing 1 nodes of type ray.worker.gpu_8cpu_1gpu_g5 (idle).
autoscaler +43h35m41s) Resized to 64 CPUs, 7 GPUs.
(autoscaler +43h35m56s) Resized to 72 CPUs, 8 GPUs.
(call_normal pid=3542, ip=10.102.194.97) Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_5_0_224_tf.h5
(call_normal pid=3542, ip=10.102.194.97) rsyncing dataset
coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890from remote: from ubuntu@10.102.195.73:/opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890 to /opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890
Since the error logs mention something about syncing between different nodes, do you think I should somehow disable this since we rsync the dataset manually and the trial dir is on a shared network drive (AWS EFS)? Thanks in advance for any hints and I'll try to reproduce the error soon.
Issue Severity
Medium: It is a significant difficulty but I can work around it.