ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[ml release] `air_benchmark_xgboost_cpu_10` is failing due to memory issues #31068

Closed Yard1 closed 1 year ago

Yard1 commented 1 year ago

What happened + What you expected to happen

air_benchmark_xgboost_cpu_10 fails with:

(raylet, ip=172.31.199.222) Spilled 3667 MiB, 16 objects, write throughput 1040 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.160.92) Spilled 2303 MiB, 10 objects, write throughput 737 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.247.167) Spilled 3753 MiB, 17 objects, write throughput 838 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.129.182) Spilled 2558 MiB, 12 objects, write throughput 643 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.205.18) Spilled 4520 MiB, 20 objects, write throughput 937 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.182.158) Spilled 2985 MiB, 13 objects, write throughput 613 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.199.222) Spilled 5373 MiB, 24 objects, write throughput 1151 MiB/s.
(raylet, ip=172.31.182.205) Spilled 2132 MiB, 9 objects, write throughput 526 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.247.167) Spilled 5544 MiB, 25 objects, write throughput 906 MiB/s.
(raylet, ip=172.31.182.158) Spilled 6141 MiB, 26 objects, write throughput 934 MiB/s.
(raylet, ip=172.31.205.18) Spilled 6056 MiB, 27 objects, write throughput 819 MiB/s.
(raylet, ip=172.31.162.203) Spilled 3753 MiB, 16 objects, write throughput 930 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.162.203) Spilled 4862 MiB, 21 objects, write throughput 954 MiB/s.
(raylet, ip=172.31.129.182) Spilled 11430 MiB, 49 objects, write throughput 799 MiB/s.
(raylet, ip=172.31.205.18) Spilled 13818 MiB, 58 objects, write throughput 885 MiB/s.
(raylet, ip=172.31.160.92) Spilled 11345 MiB, 47 objects, write throughput 754 MiB/s.
(raylet, ip=172.31.247.167) Spilled 14672 MiB, 62 objects, write throughput 847 MiB/s.
(raylet, ip=172.31.227.81) Spilled 11601 MiB, 48 objects, write throughput 710 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.182.205) Spilled 12624 MiB, 54 objects, write throughput 718 MiB/s.
(raylet, ip=172.31.227.230) Spilled 12795 MiB, 53 objects, write throughput 730 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=172.31.182.158) Spilled 15013 MiB, 62 objects, write throughput 789 MiB/s.
(raylet, ip=172.31.162.203) Spilled 8274 MiB, 36 objects, write throughput 659 MiB/s.
(raylet, ip=172.31.199.222) Spilled 15951 MiB, 68 objects, write throughput 808 MiB/s.
(raylet, ip=172.31.162.203) Spilled 16804 MiB, 71 objects, write throughput 846 MiB/s.
2022-12-12 15:03:14,340 ERROR trial_runner.py:1095 -- Trial XGBoostTrainer_c44dd_00000: Error processing event.
ray.exceptions.RayTaskError(MemoryError): ray::_Inner.train() (pid=469, ip=172.31.162.203, repr=XGBoostTrainer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable/function_trainable.py", line 338, in entrypoint
    self._status_reporter.get_checkpoint(),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 480, in _trainable_func
    super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 390, in train_func
    trainer.training_loop()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/gbdt_trainer.py", line 298, in training_loop
    **config,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/xgboost/xgboost_trainer.py", line 84, in _train
    return xgboost_ray.train(**kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost_ray/main.py", line 1414, in train
    dtrain.load_data(ray_params.num_actors)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 819, in load_data
    self.num_actors, self.sharding, rank=rank)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 382, in load_data
    self.data, ignore=self.ignore, indices=None, **self.kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost_ray/data_sources/ray_dataset.py", line 68, in load_data
    return ObjectStore.load_data(obj_refs, ignore=ignore, indices=indices)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/xgboost_ray/data_sources/object_store.py", line 32, in load_data
    return Pandas.load_data(pd.concat(local_df, copy=False), ignore=ignore)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 533, in get_result
    mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/internals/concat.py", line 216, in concatenate_managers
    values = np.concatenate(vals, axis=blk.ndim - 1)
  File "<__array_function__ internals>", line 6, in concatenate
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 77.5 GiB for an array with shape (40, 260000000) and data type float64

which suggests a memory issue. Could be a regression.

Versions / Dependencies

master

Reproduction script

https://buildkite.com/ray-project/release-tests-branch/builds/1251#0185086b-9aec-429b-9455-13cf78e5c0db

Issue Severity

None

Yard1 commented 1 year ago

Secondary issue that may be fixed alongside this one - https://github.com/ray-project/ray/issues/30790