ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.98k stars 5.44k forks source link

[Tune] lightGBM callback cannot write locally during cluster run #45773

Open SuperKam91 opened 4 weeks ago

SuperKam91 commented 4 weeks ago

What happened + What you expected to happen

The issue

For ray==2.20.0, kuberay==1.1.0, when running the light GBM ray.tune tutorial (https://docs.ray.io/en/latest/tune/examples/lightgbm_example.html) on a multi-node cluster, the run does not complete as the cluster storage cannot be setup:

(TunerInternal pid=1090302) Trial task failed for trial train_breast_cancer_a636f_00003
(TunerInternal pid=1090302) Traceback (most recent call last):
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
(TunerInternal pid=1090302)     result = ray.get(future)
(TunerInternal pid=1090302)              ^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
(TunerInternal pid=1090302)     return fn(*args, **kwargs)
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(TunerInternal pid=1090302)     return func(*args, **kwargs)
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 2623, in get
(TunerInternal pid=1090302)     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(TunerInternal pid=1090302)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 861, in get_objects
(TunerInternal pid=1090302)     raise value.as_instanceof_cause()
(TunerInternal pid=1090302) ray.exceptions.RayTaskError(RuntimeError): ray::ImplicitFunc.train() (pid=27628, ip=100.72.122.209, actor_id=815ce62aee0f4e028b4e71ca6e000000, repr=train_breast_cancer)
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 331, in train
(TunerInternal pid=1090302)     raise skipped from exception_cause(skipped)
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/air/_internal/util.py", line 98, in run
(TunerInternal pid=1090302)     self._ret = self._target(*self._args, **self._kwargs)
(TunerInternal pid=1090302)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
(TunerInternal pid=1090302)     training_func=lambda: self._trainable_func(self.config),
(TunerInternal pid=1090302)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
(TunerInternal pid=1090302)     output = fn()
(TunerInternal pid=1090302)              ^^^^
(TunerInternal pid=1090302)   File "/tmp/ipykernel_4006599/981054078.py", line 18, in train_breast_cancer
(TunerInternal pid=1090302)   File "/tmp/ray/session_2024-06-03_09-10-38_778444_1/runtime_resources/pip/e03ea1ab0b9620575b4404e353f290db608a5520/virtualenv/lib/python3.11/site-packages/lightgbm/engine.py", line 286, in train
(TunerInternal pid=1090302)     cb(callback.CallbackEnv(model=booster,
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/lightgbm/_lightgbm_utils.py", line 164, in __call__
(TunerInternal pid=1090302)     train.report(report_dict, checkpoint=checkpoint)
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 657, in wrapper
(TunerInternal pid=1090302)     return fn(*args, **kwargs)
(TunerInternal pid=1090302)            ^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 748, in report
(TunerInternal pid=1090302)     _get_session().report(metrics, checkpoint=checkpoint)
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/session.py", line 426, in report
(TunerInternal pid=1090302)     persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
(TunerInternal pid=1090302)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 539, in persist_current_checkpoint
(TunerInternal pid=1090302)     self._check_validation_file()
(TunerInternal pid=1090302)   File "/usr/local/lib/python3.11/site-packages/ray/train/_internal/storage.py", line 490, in _check_validation_file
(TunerInternal pid=1090302)     raise RuntimeError(
(TunerInternal pid=1090302) RuntimeError: Unable to set up cluster storage with the following settings:
(TunerInternal pid=1090302) StorageContext<
(TunerInternal pid=1090302)   storage_filesystem='local',
(TunerInternal pid=1090302)   storage_fs_path='/home/asu/ray_results',
(TunerInternal pid=1090302)   experiment_dir_name='train_breast_cancer_2024-06-06_07-09-00',
(TunerInternal pid=1090302)   trial_dir_name='train_breast_cancer_a636f_00003_3_boosting_type=dart,learning_rate=0.0057,num_leaves=88_2024-06-06_07-09-01',
(TunerInternal pid=1090302)   current_checkpoint_index=0,
(TunerInternal pid=1090302) >
(TunerInternal pid=1090302) Check that all nodes in the cluster have read/write access to the configured storage path. `RunConfig(storage_path)` should be set to a cloud storage URI or a shared filesystem path accessible by all nodes in your cluster ('s3://bucket' or '/mnt/nfs'). A local path on the head node is not accessible by worker nodes. See: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html

Expected output

However, if I follow a simpler example (akin to https://docs.ray.io/en/latest/tune/examples/optuna_example.html) where the objective function explicitly uses ray.train.report to report back after each training iteration, rather than a callback this issue doesn't occur.

Furthermore, on a cluster with older versions (ray==2.9.3, raykube==1.0.0), the lightGBM example above works fine.

Thus, in the later version test, there must be an issue with the way ray.tune.integration.lightgbm.TuneReportCheckpointCallback is saving to disc compared with ray.train.report, though I'm aware the former ultimately uses the former under the hood.

Attempted solutions

First I tried providing an alternative (local) path via run_config=train.RunConfig(storage_path='/tmp/ray_results'), but this produces the same error.

Next I provided a storage location in an s3 bucket. However, I find that for lightGBM, given the high number of boosting rounds, number of trials etc. this is prohibitively slow as the workers need to save this external disc very frequently.

A third alternative, not fully explored yet, is to perform something akin to a partial_fit using lightGBM, so that one can exit the fit after each "training_iteration", and manually report back the results using ray.train.report as in the second example outlined above:

"""
pseudocode, inspired by lightGBM example
"""
estimator = None
for i in num_boosting_rounds:
    estimator = lgb.train(..., num_boost_round=1, init_model=estimator)
    ray.train.report({"training_iteration": i, "binary_error": sklearn.metrics.accuracy_score(estimator.predict(test_x), test_y)})

However, I feel like this is an unorganic way to do things, and is not really resolving the underlying issue.

Versions / Dependencies

python==3.11.5
ray==2.20.0 (doesn't work) ray==2.9.3 (works)
raykube==1.1.0 (doesn't work) raykube==1.0.0 (works)
lightgbm==4.3.0

Reproduction script

"""
lightGBM example, taken from: https://docs.ray.io/en/latest/tune/examples/lightgbm_example.html
"""
import lightgbm as lgb
import sklearn.datasets
from sklearn.model_selection import train_test_split
import ray
from ray import train, tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.lightgbm import TuneReportCheckpointCallback

ray.init(...) # connect to cluster here

def train_breast_cancer(config):

    data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.25)
    train_set = lgb.Dataset(train_x, label=train_y)
    test_set = lgb.Dataset(test_x, label=test_y)
    gbm = lgb.train(
        config,
        train_set,
        valid_sets=[test_set],
        valid_names=["eval"],
           # perpetrator
            callbacks=[
            TuneReportCheckpointCallback(
                {
                    "binary_error": "eval-binary_error",
                    "binary_logloss": "eval-binary_logloss",
                }
            )
        ],
    )

if __name__ == "__main__":
    config = {
        "objective": "binary",
        "metric": ["binary_error", "binary_logloss"],
        "verbose": -1,
        "boosting_type": tune.grid_search(["gbdt", "dart"]),
        "num_leaves": tune.randint(10, 1000),
        "learning_rate": tune.loguniform(1e-8, 1e-1),
    }

    tuner = tune.Tuner(
        train_breast_cancer,
        tune_config=tune.TuneConfig(
            metric="binary_error",
            mode="min",
            scheduler=ASHAScheduler(),
            num_samples=2,
        ),
        param_space=config,
        run_config=train.RunConfig(storage_path='/tmp/ray_results') # change local dir being saved to
        # run_config=train.RunConfig(storage_path='s3:...') # s3 storage attempt

    )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)
"""
trivial objective function example, inspired by: https://docs.ray.io/en/latest/tune/examples/optuna_example.html
"""

from ray import train, tune
from ray.tune.schedulers import ASHAScheduler
import ray

ray.init(...) # connect to cluster here

def evaluate(step, width, height, activation):
    time.sleep(0.1)
    activation_boost = 10 if activation=="relu" else 0
    return (0.1 + width * step / 1000) ** (-1) + height * 0.1 + activation_boost
def objective(config):
    for step in range(config["steps"]):
        score = evaluate(step, config["width"], config["height"], config["activation"])
        train.report({"iterations": step, "mean_loss": score})
search_space = {
"steps": 100,
"width": tune.uniform(0, 20),
"height": tune.uniform(-100, 100),
"activation": tune.choice(["relu", "tanh"]),
}

tuner = tune.Tuner(
objective,
tune_config=tune.TuneConfig(
    metric="mean_loss",
    mode="min",
    num_samples=2,
    scheduler=ASHAScheduler(time_attr='iterations')
    ),
param_space=search_space)
)
results = tuner.fit()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

SuperKam91 commented 2 weeks ago

I tried using LGBM-ray as an alternative but that guy seems to have its own issues with Ray(==2.20.0)