Parallelization of xgboost_ray.train results in Error

zcarrico-fn commented 10 months ago

For cross-validation, we are attempting to parallelize xgboost_ray.train using ray.remote tasks. Each remote task uses a different cross-validation split of the data. Unfortunately, parallelizing xgboost_ray.train results in the below errors. If the same tasks are run sequentially, rather than in parallel, no errors occur. Below is a reproducible example based on xgboost_ray's documentation's example. If this is run locally, it completes. It's only when parallelized on a remote Ray cluster that it results in the below errors.

from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
import ray
import numpy as np

ray.init("ray://ray-cluster-head-svc:10001")

@ray.remote
def f():
    train_x, train_y = load_breast_cancer(return_X_y=True)

    # the amount of data is increased because with small amounts of data no ConnectionError is encountered
    train_x = np.tile(train_x, (10000, 1))
    train_y = np.repeat(train_y, 10000)

    train_set = RayDMatrix(train_x, train_y)
    evals_result = {}
    bst = train(
        {
            "objective": "binary:logistic",
            "eval_metric": ["logloss", "error"],
        },
        train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=RayParams(
            num_actors=2,  # Number of remote actors
            cpus_per_actor=1))

    bst.save_model("model.xgb")
    print("Final training error: {:.4f}".format(
        evals_result["train"]["error"][-1]))

# The same data is used for both tasks for the sake of this example, but for cross-validation, 
#  the data passed to each task would be unique
ray.get([f.remote() for i in range(2)])

Traceback:

(freenome-risk-prediction-py3.10) ➜  risk-prediction git:(main) ✗ python risk_prediction/mre.py
(autoscaler +9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +9s) Adding 2 node(s) of type main.
(autoscaler +19s) Resized to 1 CPUs.
(autoscaler +19s) Adding 1 node(s) of type main.
(autoscaler +29s) Resized to 2 CPUs.
(autoscaler +45s) Adding 1 node(s) of type main.
(f pid=61, ip=10.76.2.5) [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2023-08-10 14:11:59,613 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-08-10 14:12:16,715 ERROR dataclient.py:330 -- Unrecoverable error in data channel.
Traceback (most recent call last):
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 455, in _get
    for chunk in resp:
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 324, in _get_object_iterator
    for chunk in self.server.GetObject(req, *args, **kwargs):
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 475, in __next__
    return self._next()
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/grpc/_channel.py", line 881, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "'NoneType' object is not iterable"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zcarrico/risk-prediction/risk_prediction/mre.py", line 32, in <module>
    ray.get([f.remote() for i in range(2)])
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/zcarrico/.virtualenvs/freenome-risk-prediction-gzgYbMUP-py3.10/lib/python3.10/site-packages/ray/util/client/worker.py", line 477, in _get
    raise decode_exception(e)
ConnectionError: GRPC connection failed: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "'NoneType' object is not iterable"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:10.123.5.189:10001 {created_time:"2023-08-10T14:12:17.138947014+00:00", grpc_status:9, grpc_message:"\'NoneType\' object is not iterable"}"
>

I expect the same error will be encountered when parallelizing remote tasks for nested cross-validation for HPO.

Please let me know if you have any questions and thank you for the help and the great xgboost_ray library!

Anner-deJong commented 2 months ago

Far from an expert on the Ray library, but perhaps it's possible to use Ray Tune, with the hyperparameter specifying which i'th fold to use?

i.e. All same data is send to each tune run, but with each run sorts/randomizes data exactly the same (each same random order is required inside each tune run so as to have each sample be part of the val split exactly once), and have the ray tune param_space identifying which i'th split to use as val split for each run

zcarrico-fn commented 2 months ago

Thank you @Anner-deJong , and this works with customizable training functions, but ray_xgboost isn't customizable beyond argument input as far as I know. In other words, the fold values could be passed as a hyperparam, but there's no way to pass a custom callable to ray_xgboost to have xgboost_ray select data based on the fold value hyperparameter.

ray-project / xgboost_ray

Parallelization of xgboost_ray.train results in Error #291