ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.06k stars 5.78k forks source link

Ray Train incompatible broken with XGBoost 2.1.0 #46476

Closed Lothiraldan closed 3 months ago

Lothiraldan commented 4 months ago

What happened + What you expected to happen

Hi, I'm maintaining some examples of Ray with Comet and our automated CI detected a failure in the following Ray + XGBoost example: https://github.com/comet-ml/comet-examples/blob/master/integrations/model-training/ray-train/notebooks/Comet_with_ray_train_xgboost.ipynb

It used to run fine in the past but started to fails recently with the following traceback:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/_private/worker.py", line 2639, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::_Inner.train() (pid=2292, ip=10.1.0.171, actor_id=db308eb6c13947a5e1d854cf01000000, repr=XGBoostTrainer)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/air/_internal/util.py", line 98, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/base_trainer.py", line 799, in _trainable_func
    super()._trainable_func(self._merged_config)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 248, in _trainable_func
    output = fn()
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn
    trainer.training_loop()
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/data_parallel_trainer.py", line 460, in training_loop
    training_iterator = self._training_iterator_cls(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/trainer.py", line 51, in __init__
    self._start_training(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/trainer.py", line 76, in _start_training
    self._run_with_error_handling(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/trainer.py", line 89, in _run_with_error_handling
    return func()
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/trainer.py", line 77, in <lambda>
    lambda: self._backend_executor.start_training(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/_internal/backend_executor.py", line 540, in start_training
    self._backend.on_training_start(self.worker_group, self._backend_config)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/ray/train/xgboost/config.py", line 74, in on_training_start
    rabit_args.update(self._tracker.worker_envs())
AttributeError: 'RabitTracker' object has no attribute 'worker_envs'

I tracked it down to a recent change in XGBoost 2.1.0 that removed a method that Ray XGBoost was using RabitTracker.worker_envs in the following PR: https://github.com/dmlc/xgboost/commit/a5a58102e5e82fa508514c34cd8e5f408dcfd3e1#diff-bb6f193f9b5b43834b65f777b7c21d33440618e006c965f2e962b17ec9100444

On my side, I will pin an older of version of xgboost for now but I didn't find an open bug for it so preferred to report it.

Versions / Dependencies

Python version: 3.10.12 Pip freeze:

aiohttp==3.9.5
aiohttp-cors==0.7.0
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.3.3
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorful==0.5.6
comm==0.2.2
contourpy==1.2.1
cycler==0.12.1
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
distlib==0.3.8
dnspython==2.6.1
email_validator==2.2.0
envdir==1.0.1
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.20.0
filelock==3.15.4
fonttools==4.53.1
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.6.1
google-api-core==2.19.1
google-auth==2.31.0
googleapis-common-protos==1.63.2
grpcio==1.64.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
idna==3.7
ipykernel==6.29.5
ipython==8.26.0
ipywidgets==8.1.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.1
jupyter_server_terminals==0.5.3
jupyterlab==4.2.3
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.11
kiwisolver==1.4.5
lab==8.2
linkify-it-py==2.0.3
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.1
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.1
mdurl==0.1.2
memray==1.13.3
mistune==3.0.2
msgpack==1.0.8
multidict==6.0.5
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
notebook==7.2.1
notebook_shim==0.2.4
numpy==2.0.0
nvidia-nccl-cu12==2.22.3
opencensus==0.11.4
opencensus-context==0.1.3
orjson==3.10.6
overrides==7.7.0
packaging==24.1
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pillow==10.4.0
platformdirs==4.2.2
prometheus_client==0.20.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==5.27.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
py-spy==0.3.14
pyarrow==16.1.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
ray==2.31.0
referencing==0.35.1
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rpds-py==0.19.0
rsa==4.9
scipy==1.14.0
Send2Trash==1.8.3
shellingham==1.5.4
simplejson==3.19.2
six==1.16.0
smart-open==7.0.4
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
tensorboardX==2.6.2.2
terminado==0.18.1
textual==0.71.0
tinycss2==1.3.0
tomli==2.0.1
tornado==6.4.1
traitlets==5.14.3
txt2tags==3.9
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.2
tzdata==2024.1
uc-micro-py==1.0.3
ujson==5.10.0
uri-template==1.3.0
urllib3==2.2.2
uvicorn==0.30.1
uvloop==0.19.0
virtualenv==20.26.3
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==24.6.0
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
wrapt==1.16.0
xgboost==2.1.0
xgboost-ray==0.1.19
yarl==1.9.4
Note: you may need to restart the kernel to use updated packages.

Reproduction script

import os

import ray
from ray.air.config import RunConfig, ScalingConfig
from ray.train import Result
from ray.train.xgboost import XGBoostTrainer

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(
    test_size=0.3, shuffle=True, seed=536
)

def train_xgboost(
    num_workers: int = 2, use_gpu: bool = False, num_boost_round: int = 20
) -> Result:
    config = {}

    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(
            # Number of workers to use for data parallelism.
            num_workers=num_workers,
            # Whether to use GPU acceleration. Set to True to schedule GPU workers.
            use_gpu=use_gpu,
        ),
        label_column="target",
        num_boost_round=num_boost_round,
        params={
            # XGBoost specific params (see the `xgboost.train` API reference)
            "objective": "binary:logistic",
            # uncomment this and set `use_gpu=True` to use GPU for training
            # "tree_method": "gpu_hist",
            "eval_metric": ["logloss", "error"],
            # Make the build reproducible
            "random_state": 536,
        },
        datasets={"train": train_dataset, "valid": valid_dataset},
    )
    result = trainer.fit()
    return result

ideal_num_workers = 2

available_local_cpu_count = os.cpu_count() - 1
num_workers = min(ideal_num_workers, available_local_cpu_count)

if num_workers < 1:
    num_workers = 1

train_xgboost(num_workers, use_gpu=False, num_boost_round=10)

Issue Severity

Low: It annoys or frustrates me.

vladosby commented 4 months ago

I have the same issue. Ray 2.32.0 python 3.11.9

anyscalesam commented 4 months ago

thanks - we'll triage today.

Spin8Cycle commented 2 months ago

I also have the same issue : AttributeError: 'RabitTracker' object has no attribute 'worker_envs'

I downgraded to xgboost 2.1.0 but I'm still having the same error.

yenicelik commented 1 month ago

Getting the same issue, running Python 3.10.14 and following requirements (trimmed down):

kaleido==0.2.1
kedro==0.19.8
kedro-datasets==4.1.0
kedro-telemetry==0.6.0
kedro-viz==10.0.0
modin==0.26.1
notebook==7.2.2
notebook_shim==0.2.4
numba==0.60.0
numpy==1.26.4
omegaconf==2.3.0
ray==2.34.0
scikit-base==0.7.8
scikit-learn==1.4.2
scikit-plot==0.3.7
scipy==1.11.4
seaborn==0.12.2
wandb==0.18.1
xgboost==2.1.0
justinvyu commented 1 month ago

@Spin8Cycle @yenicelik Can you try upgrading ray>=2.35.0?

That's the first version where xgboost 2.1.x support is added from this PR: https://github.com/ray-project/ray/pull/46667