ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.86k stars 5.75k forks source link

[Tune] 'NoneType' object has no attribute 'hex' #42873

Closed vedin-eta closed 8 months ago

vedin-eta commented 9 months ago

What happened + What you expected to happen

We use Ray Tune for our model training through Ray Clusters which are AWS spot instances. The error happened with a long running Ray Tune job on a cluster (after 2 days of running). It happened only once so far. Everything in that regard used to work properly so it's tough to reproduce even with the same config.

I checked another issue referencing the same error, but the fixes didn't apply to my case: in my case, I had no issues with IP addresses.

Relevant logs from Ray Tune:

(automl_entry pid=5893) == Status ==
(automl_entry pid=5893) Current time: 2024-01-31 10:18:42 (running for 1 days, 17:49:28.17)
(automl_entry pid=5893) Using AsyncHyperBand: num_stopped=0
(automl_entry pid=5893) Bracket: Iter 5120.000: None | Iter 1280.000: None | Iter 320.000: None | Iter 80.000: None | Iter 20.000: None
(automl_entry pid=5893) Logical resource usage: 64.0/8 CPUs, 7.920000000000001/0 GPUs (0.0/1.0 cpu_entry_token)
(automl_entry pid=5893) Result logdir: /mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs
(automl_entry pid=5893) Number of trials: 101/infinite (8 RUNNING, 93 TERMINATED)
(automl_entry pid=5893) Number of errored trials: 34
(automl_entry pid=5893) +--------------+--------------+--------------+
(automl_entry pid=5893) | Trial name   | # failures   | error file   |
(automl_entry pid=5893) |--------------+--------------+--------------|
(automl_entry pid=5893) +--------------+--------------+--------------+
(automl_entry pid=5893) 
(_LoggingActor pid=9501) Dumping logs...
(call_normal pid=40878, ip=10.102.194.112) ====== [dummy model server return] chip_model_memory_dict: {'ram': {'used': 7812276.0, 'available': 5997620.0, 'activations': 159.89453125}, 'rom': {'used': 31999568.0, 'available': 38137396.0}} ======
(automl_entry pid=5893) Got trial: 303, Trial_ac144965
(automl_entry pid=5893) Finished processing trial: 303, Trial_ac144965
(_LoggingActor pid=9501) Stopping logger timer
Traceback (most recent call last):
  File "/tmp/ray/session_2024-01-29_14-25-59_628020_3300/runtime_resources/working_dir_files/_ray_pkg_f54f7dd5e076a8f4/automl_entry.py", line 77, in <module>
    args.dataset_dir = ray.get(ray_handle, timeout=timeout)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/_private/worker.py", line 2540, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::automl_entry() (pid=5893, ip=10.102.195.73)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 735, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 805, in _process_trial_result
    self._callbacks.on_trial_result(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/callback.py", line 392, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 857, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 811, in _sync_trial_dir
    sync_process.wait()
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 297, in wait
    raise exception
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 260, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 172, in _sync_dir_between_different_nodes
    num_cpus=0, **_force_on_node(source_node_id)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
    node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'

During handling of the above exception, another exception occurred:

ray::automl_entry() (pid=5893, ip=10.102.195.73)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tuner.py", line 347, in fit
    return self._local_tuner.fit()
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 712, in _fit_internal
    analysis = run(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tune.py", line 1070, in run
    runner.step()
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 256, in step
    if not self._actor_manager.next(timeout=0.1):
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 224, in next
    self._actor_task_events.resolve_future(future)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
    on_result(result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 752, in on_result
    self._actor_task_resolved(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 300, in _actor_task_resolved
    tracked_actor_task._on_result(tracked_actor, result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 824, in _on_result
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
    on_result(trial, *args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 735, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 805, in _process_trial_result
    self._callbacks.on_trial_result(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/callback.py", line 392, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 857, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 811, in _sync_trial_dir
    sync_process.wait()
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 297, in wait
    raise exception
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/syncer.py", line 260, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 172, in _sync_dir_between_different_nodes
    num_cpus=0, **_force_on_node(source_node_id)
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
    node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'

The above exception was the direct cause of the following exception:

ray::automl_entry() (pid=5893, ip=10.102.195.73)
  File "/tmp/ray/session_2024-01-29_14-25-59_628020_3300/runtime_resources/working_dir_files/_ray_pkg_f54f7dd5e076a8f4/automl_entry.py", line 29, in automl_entry
    return autoML_cluster_kickoff(working_dir, dataset_dir, oauth_token=oauth_token, fake_run=False)
  File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 426, in autoML_cluster_kickoff
    _autoML_condition(
  File "/home/ubuntu/lib/AutoMLTraining/automl/Utils/logger.py", line 165, in __exit__
    raise value
  File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 426, in autoML_cluster_kickoff
    _autoML_condition(
  File "/home/ubuntu/lib/AutoMLTraining/automl/autoML_cluster_kickoff.py", line 423, in _autoML_condition
    autoML_condition(*args, **kwargs)
  File "/home/ubuntu/lib/AutoMLTraining/automl/automl_main.py", line 357, in autoML_condition
    tuner.fit()
  File "/home/ubuntu/miniconda3/envs/automl/lib/python3.9/site-packages/ray/tune/tuner.py", line 349, in fit
    raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs", trainable=...)`.

Versions / Dependencies

Main dependency versions:

python3==3.9.18
ray=2.5.0
ubuntu=Ubuntu 20.04.6 LTS

Other libraries:

absl-py==1.4.0
aiohttp==3.9.1
aiohttp-cors==0.7.0
aiosignal==1.3.1
albumentations==1.3.0
alembic==1.13.0
antlr4-python3-runtime==4.9.3
AptosConnector @ git+https://github.com/Eta-Compute/AptosConnector.git@7293b71ef2a2190f0e829450ee06e08264186af2
array-record==0.5.0
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.1.0
-e git+ssh://git@github.com/Eta-Compute/AutoMLTraining.git@3839752d9107a1600b7bd5b7c45c04e8679a3809#egg=AutoMLTraining
black==23.12.0
blessed==1.20.0
blosc2==2.5.1
boto3==1.28.31
botocore==1.31.85
botorch==0.7.3
cachetools==5.3.2
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
click-shell==2.1
cloudpickle==3.0.0
cmaes==0.10.0
colorful==0.5.5
colorlog==6.8.0
cycler==0.12.1
Cython==3.0.7
dataclasses==0.6
detectron2 @ git+https://github.com/facebookresearch/detectron2.git@a0e22dbfa791e6235e4f196d5ce25e754d02be31
diffusers==0.24.0
dirhash==0.2.1
distlib==0.3.8
dm-tree==0.1.8
einops==0.7.0
etils==1.5.2
facexlib==0.3.0
fairscale==0.4.13
filelock==3.13.1
filterpy==1.4.5
fire==0.5.0
flatbuffers==23.5.26
fonttools==4.47.0
frozenlist==1.4.1
fsspec==2023.12.2
ftfy==6.1.3
future==0.18.3
fvcore==0.1.5.post20221221
gast==0.4.0
getpass-asterisk==1.0.1
google-api-core==2.15.0
google-auth==2.25.2
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
googleapis-common-protos==1.62.0
gpustat==1.1.1
GPUtil==1.4.0
gpytorch==1.9.0
greenlet==3.0.2
grpcio==1.51.3
h5py==3.10.0
huggingface-hub==0.20.0
hydra-core==1.3.2
idna==3.6
imageio==2.33.1
imagesize==1.4.1
imaginAIry==12.0.3
imantics==0.1.12
imgaug==0.4.0
importlib-metadata==7.0.0
importlib-resources==6.1.1
iopath==0.1.9
jmespath==1.0.1
joblib==1.3.2
jsonlines==4.0.0
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
jstyleson==0.0.2
keras==2.13.1
keras-flops==0.1.1
kiwisolver==1.4.5
kornia==0.7.0
labelme2coco==0.2.4
libclang==16.0.6
lightning-utilities==0.10.0
linear-operator==0.2.0
llvmlite==0.41.1
lxml==4.9.4
Mako==1.3.0
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib==3.5.2
matplotlib-inline==0.1.3
mean-average-precision @ git+https://github.com/bes-dev/mean_average_precision.git@9f4d65036de9c5deca7ec0c13c4d4278ecf8e3c3
msgpack==1.0.7
multidict==6.0.4
multipledispatch==1.0.0
mypy-extensions==1.0.0
ndindex==1.7
networkx==3.2.1
numba==0.58.1
numexpr==2.9.0
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-ml-py==12.535.133
nvidia-ml-py3==7.352.0
oauthlib==3.2.2
omegaconf==2.3.0
open-clip-torch==2.23.0
opencensus==0.11.3
opencensus-context==0.1.3
opencv-python==4.7.0.72
opencv-python-headless==4.8.1.78
opt-einsum==3.3.0
optuna==3.1.1
packaging==23.2
pandas==1.5.3
parse==1.20.0
pathspec==0.12.1
patsy==0.5.4
pexpect==4.9.0
Pillow==9.5.0
pixellib==0.7.1
platformdirs==3.11.0
plotly==5.18.0
portalocker==2.8.2
prometheus-client==0.19.0
promise==2.3
protobuf==3.20.3
psutil==5.9.7
ptyprocess==0.7.0
py-cpuinfo==9.0.0
py-spy==0.3.14
pyarrow==14.0.2
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybboxes==0.1.6
pycocotools==2.0.7
pydantic==1.10.13
pydot==1.4.2
pyparsing==3.1.1
PyQt5==5.15.10
PyQt5-Qt5==5.15.2
PyQt5-sip==12.13.0
pyre-extensions==0.0.23
pyro-api==0.1.2
pyro-ppl==1.8.6
python-dateutil==2.8.2
pytorch-lightning==1.9.5
pytz==2023.3.post1
PyWavelets==1.5.0
PyYAML==6.0.1
qudida==0.0.4
ray==2.5.0
referencing==0.32.0
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rpds-py==0.15.2
rsa==4.9
s3transfer==0.6.2
safetensors==0.4.1
sahi==0.11.15
scandir==1.10.0
scantree==0.0.2
scikit-image==0.19.3
scikit-learn==1.2.1
scipy==1.11.4
sentencepiece==0.1.99
shapely==2.0.2
simplejson==3.19.2
six==1.16.0
smart-open==6.4.0
SQLAlchemy==2.0.23
statsmodels==0.13.5
tables==3.9.2
tabulate==0.9.0
tenacity==8.2.3
tensorboard==2.13.0
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
tensorflow==2.13.1
tensorflow-datasets==4.9.3
tensorflow-estimator==2.13.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-metadata==1.14.0
termcolor==2.4.0
terminaltables==3.1.10
tflite==2.10.0
threadpoolctl==3.2.0
tifffile==2023.12.9
timm==0.9.12
tokenizers==0.15.0
toml==0.10.2
tomli==2.0.1
torch==1.13.1
torchdiffeq==0.2.3
torcheval==0.0.7
torchinfo==1.8.0
torchmetrics==1.2.1
torchvision==0.14.1
tqdm==4.66.1
traitlets==5.14.0
transformers==4.36.2
typing-inspect==0.9.0
typing_extensions==4.5.0
tzdata==2023.3
ujson==5.9.0
urllib3==1.26.18
virtualenv==20.21.0
wcwidth==0.2.12
Werkzeug==3.0.1
wrapt==1.16.0
xformers==0.0.16
xmljson==0.2.1
xtcocotools==1.14.3
yacs==0.1.8
yarl==1.9.4
zipp==3.17.0

Reproduction script

I'm struggling to reproduce the issue myself. I'm not looking for extensive coding help, just for some advice on figuring out if we should stop doing something we're doing currently to prevent the bug from occurring again.

However, I will share some more specifics of what happened before the job crashed:

  1. The autoscaler removed a node: (autoscaler +43h35m31s) Removing 1 nodes of type ray.worker.gpu_8cpu_1gpu_g5 (idle). autoscaler +43h35m41s) Resized to 64 CPUs, 7 GPUs.
  2. Another node was quickly added by the autoscaler: (autoscaler +43h35m56s) Resized to 72 CPUs, 8 GPUs.
  3. Since the previous worker crashed (probably a spot instance repossessed by AWS), we load from the last checkpoint:
    (call_normal pid=40878, ip=10.102.194.112) Loading from last checkpoint...
    (call_normal pid=40878, ip=10.102.194.112) Loaded model weights from /mnt/efs-data/jobs/20240129_141911_4eafbafa-4cfa-467f-bb8f-36302367e2a5/results/Raytune_logs/Trial_ac144965_67_anchor_aspect_ratios=0_5_1_0_2_0,anchor_grid_split_xy=1,anchor_matching_iou=0.3500,anchor_scale_min_max=0_1_0_9,_2024-01-30_17-06-39/checkpoint_tmp2a67c5
    (call_normal pid=40878, ip=10.102.194.112) Loading from last epoch: 66 with learning rate: 0.00010940543143078685
  4. We download a model using Keras: (call_normal pid=3542, ip=10.102.194.97) Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_5_0_224_tf.h5
  5. We rsync the dataset from the head node to the worker node: (call_normal pid=3542, ip=10.102.194.97) rsyncing datasetcoco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890from remote: from ubuntu@10.102.195.73:/opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890 to /opt/dlami/nvme/tfds-cache/coco_2017f5146dea0cb6570065a6b46ab137139a9dd74aecfc7550b623d3fd6caea44890

Since the error logs mention something about syncing between different nodes, do you think I should somehow disable this since we rsync the dataset manually and the trial dir is on a shared network drive (AWS EFS)? Thanks in advance for any hints and I'll try to reproduce the error soon.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

matthewdeng commented 9 months ago

@vedin-eta would it be possible for you to update to a new version of Ray (e.g. 2.9)? This syncing logic has since been revamped significantly and should no longer rely on the previous IP. cc @justinvyu