ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
143 stars 34 forks source link

Incompatibility with Modin 0.10.0 DataFrame #116

Closed matthewdeng closed 3 years ago

matthewdeng commented 3 years ago

Problem: After bumping to modin 0.10.0, xgboost_ray no longer recognizes modin's DataFrame. Calling train on a modin DataFrame results in the following error:

2021-06-16 21:55:57,585 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: Distributing <class 'NoneType'> object. This may take some time.
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    train({}, matrix)
  File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 1196, in train
    dtrain.load_data(ray_params.num_actors)
  File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 741, in load_data
    refs, self.n = self.loader.load_data(
  File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 317, in load_data
    data_source = self.get_data_source()
  File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 278, in get_data_source
    raise ValueError(
ValueError: Unknown data source type: <class 'modin.pandas.dataframe.DataFrame'> with FileType: None.
FIX THIS by passing a supported data type. Supported data types include pandas.DataFrame, pandas.Series, np.ndarray, and CSV/Parquet file paths. If you specify a file, path, consider passing the `filetype` argument to specify the type of the source. Use the `RayFileType` enum for that.

Steps to Reproduce:

  1. Install modin 0.10.0: pip install -U modin==0.10.0 a. This also happens with the latest commit: pip install -U git+https://github.com/modin-project/modin
  2. Run the following python script:
    
    import modin.pandas as pd
    import ray
    from xgboost_ray import RayDMatrix, train

ray.init() df = pd.DataFrame() matrix = RayDMatrix(df, label="label") train({}, matrix)


**Reference:**
This issue does not happen with the previous version of `modin 0.9.1`.

1. Downgrade to `modin 0.9.1`: `pip uninstall -y modin && pip install modin==0.9.1`
2. Running the same script above successfully loads the data (it fails later in the `train` method due to the simplicity of the script). 

2021-06-16 21:54:54,627 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265 UserWarning: Distributing <class 'NoneType'> object. This may take some time. 2021-06-16 21:54:56,290 INFO main.py:853 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training. Traceback (most recent call last): File "test.py", line 11, in train({}, matrix) File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 1267, in train bst, train_evals_result, train_additional_results = _train( File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 859, in _train dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors) File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 711, in assert_enough_shards_for_actors self.loader.assert_enough_shards_for_actors(num_actors=num_actors) File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 433, in assert_enough_shards_for_actors raise RuntimeError( RuntimeError: Trying to shard data for 4 actors, but the maximum number of shards is 0. If you want to shard the dataset by rows, consider centralized loading by passing distributed=False to the RayDMatrix. Otherwise consider using fewer actors or re-partitioning your data.