Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
(TunerInternal pid=1090302) Trial task failed for trial train_breast_cancer_a636f_00003
(TunerInternal pid=1090302) Traceback (most recent call last):
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
(TunerInternal pid=1090302) result = ray.get(future)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
(TunerInternal pid=1090302) return fn(*args, **kwargs)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(TunerInternal pid=1090302) return func(*args, **kwargs)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 2623, in get
(TunerInternal pid=1090302) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 861, in get_objects
(TunerInternal pid=1090302) raise value.as_instanceof_cause()
(TunerInternal pid=1090302) ray.exceptions.RayTaskError(RuntimeError): ray::ImplicitFunc.train() (pid=27628, ip=100.72.122.209, actor_id=815ce62aee0f4e028b4e71ca6e000000, repr=train_breast_cancer)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 331, in train
(TunerInternal pid=1090302) raise skipped from exception_cause(skipped)
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/air/_internal/util.py", line 98, in run
(TunerInternal pid=1090302) self._ret = self._target(*self._args, **self._kwargs)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
(TunerInternal pid=1090302) training_func=lambda: self._trainable_func(self.config),
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
(TunerInternal pid=1090302) output = fn()
(TunerInternal pid=1090302) ^^^^
(TunerInternal pid=1090302) File "/tmp/ipykernel_4006599/981054078.py", line 18, in train_breast_cancer
(TunerInternal pid=1090302) File "/tmp/ray/session_2024-06-03_09-10-38_778444_1/runtime_resources/pip/e03ea1ab0b9620575b4404e353f290db608a5520/virtualenv/lib/python3.11/site-packages/lightgbm/engine.py", line 286, in train
(TunerInternal pid=1090302) cb(callback.CallbackEnv(model=booster,
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/lightgbm/_lightgbm_utils.py", line 164, in __call__
(TunerInternal pid=1090302) train.report(report_dict, checkpoint=checkpoint)
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 657, in wrapper
(TunerInternal pid=1090302) return fn(*args, **kwargs)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 748, in report
(TunerInternal pid=1090302) _get_session().report(metrics, checkpoint=checkpoint)
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 426, in report
(TunerInternal pid=1090302) persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
(TunerInternal pid=1090302) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 539, in persist_current_checkpoint
(TunerInternal pid=1090302) self._check_validation_file()
(TunerInternal pid=1090302) File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 490, in _check_validation_file
(TunerInternal pid=1090302) raise RuntimeError(
(TunerInternal pid=1090302) RuntimeError: Unable to set up cluster storage with the following settings:
(TunerInternal pid=1090302) StorageContext<
(TunerInternal pid=1090302) storage_filesystem='local',
(TunerInternal pid=1090302) storage_fs_path='/home/asu/ray_results',
(TunerInternal pid=1090302) experiment_dir_name='train_breast_cancer_2024-06-06_07-09-00',
(TunerInternal pid=1090302) trial_dir_name='train_breast_cancer_a636f_00003_3_boosting_type=dart,learning_rate=0.0057,num_leaves=88_2024-06-06_07-09-01',
(TunerInternal pid=1090302) current_checkpoint_index=0,
(TunerInternal pid=1090302) >
(TunerInternal pid=1090302) Check that all nodes in the cluster have read/write access to the configured storage path. `RunConfig(storage_path)` should be set to a cloud storage URI or a shared filesystem path accessible by all nodes in your cluster ('s3://bucket' or '/mnt/nfs'). A local path on the head node is not accessible by worker nodes. See: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html
Expected output
However, if I follow a simpler example (akin to https://docs.ray.io/en/latest/tune/examples/optuna_example.html) where the objective function explicitly uses ray.train.report to report back after each training iteration, rather than a callback this issue doesn't occur.
Furthermore, on a cluster with older versions (ray==2.9.3, raykube==1.0.0), the lightGBM example above works fine.
Thus, in the later version test, there must be an issue with the way ray.tune.integration.lightgbm.TuneReportCheckpointCallback is saving to disc compared with ray.train.report, though I'm aware the former ultimately uses the former under the hood.
Attempted solutions
First I tried providing an alternative (local) path via run_config=train.RunConfig(storage_path='/tmp/ray_results'), but this produces the same error.
Next I provided a storage location in an s3 bucket. However, I find that for lightGBM, given the high number of boosting rounds, number of trials etc. this is prohibitively slow as the workers need to save this external disc very frequently.
A third alternative, not fully explored yet, is to perform something akin to a partial_fit using lightGBM, so that one can exit the fit after each "training_iteration", and manually report back the results using ray.train.report as in the second example outlined above:
"""
pseudocode, inspired by lightGBM example
"""
estimator = None
for i in num_boosting_rounds:
estimator = lgb.train(..., num_boost_round=1, init_model=estimator)
ray.train.report({"training_iteration": i, "binary_error": sklearn.metrics.accuracy_score(estimator.predict(test_x), test_y)})
However, I feel like this is an unorganic way to do things, and is not really resolving the underlying issue.
What happened + What you expected to happen
The issue
For
ray==2.20.0
,kuberay==1.1.0
, when running the light GBM ray.tune tutorial (https://docs.ray.io/en/latest/tune/examples/lightgbm_example.html) on a multi-node cluster, the run does not complete as the cluster storage cannot be setup:Expected output
However, if I follow a simpler example (akin to https://docs.ray.io/en/latest/tune/examples/optuna_example.html) where the objective function explicitly uses
ray.train.report
to report back after each training iteration, rather than acallback
this issue doesn't occur.Furthermore, on a cluster with older versions (
ray==2.9.3
,raykube==1.0.0
), the lightGBM example above works fine.Thus, in the later version test, there must be an issue with the way
ray.tune.integration.lightgbm.TuneReportCheckpointCallback
is saving to disc compared withray.train.report
, though I'm aware the former ultimately uses the former under the hood.Attempted solutions
First I tried providing an alternative (local) path via
run_config=train.RunConfig(storage_path='/tmp/ray_results')
, but this produces the same error.Next I provided a storage location in an
s3
bucket. However, I find that for lightGBM, given the high number of boosting rounds, number of trials etc. this is prohibitively slow as the workers need to save this external disc very frequently.A third alternative, not fully explored yet, is to perform something akin to a
partial_fit
using lightGBM, so that one can exit the fit after each "training_iteration", and manually report back the results using ray.train.report as in the second example outlined above:However, I feel like this is an unorganic way to do things, and is not really resolving the underlying issue.
Versions / Dependencies
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.