Problem:
After bumping to modin 0.10.0, xgboost_ray no longer recognizes modin's DataFrame. Calling train on a modinDataFrame results in the following error:
2021-06-16 21:55:57,585 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: Distributing <class 'NoneType'> object. This may take some time.
Traceback (most recent call last):
File "test.py", line 11, in <module>
train({}, matrix)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 1196, in train
dtrain.load_data(ray_params.num_actors)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 741, in load_data
refs, self.n = self.loader.load_data(
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 317, in load_data
data_source = self.get_data_source()
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 278, in get_data_source
raise ValueError(
ValueError: Unknown data source type: <class 'modin.pandas.dataframe.DataFrame'> with FileType: None.
FIX THIS by passing a supported data type. Supported data types include pandas.DataFrame, pandas.Series, np.ndarray, and CSV/Parquet file paths. If you specify a file, path, consider passing the `filetype` argument to specify the type of the source. Use the `RayFileType` enum for that.
Steps to Reproduce:
Install modin 0.10.0: pip install -U modin==0.10.0
a. This also happens with the latest commit: pip install -U git+https://github.com/modin-project/modin
Run the following python script:
import modin.pandas as pd
import ray
from xgboost_ray import RayDMatrix, train
**Reference:**
This issue does not happen with the previous version of `modin 0.9.1`.
1. Downgrade to `modin 0.9.1`: `pip uninstall -y modin && pip install modin==0.9.1`
2. Running the same script above successfully loads the data (it fails later in the `train` method due to the simplicity of the script).
2021-06-16 21:54:54,627 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: Distributing <class 'NoneType'> object. This may take some time.
2021-06-16 21:54:56,290 INFO main.py:853 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
Traceback (most recent call last):
File "test.py", line 11, in
train({}, matrix)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 1267, in train
bst, train_evals_result, train_additional_results = _train(
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 859, in _train
dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 711, in assert_enough_shards_for_actors
self.loader.assert_enough_shards_for_actors(num_actors=num_actors)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 433, in assert_enough_shards_for_actors
raise RuntimeError(
RuntimeError: Trying to shard data for 4 actors, but the maximum number of shards is 0. If you want to shard the dataset by rows, consider centralized loading by passing distributed=False to the RayDMatrix. Otherwise consider using fewer actors or re-partitioning your data.
Problem: After bumping to
modin 0.10.0
,xgboost_ray
no longer recognizesmodin
'sDataFrame
. Callingtrain
on amodin
DataFrame
results in the following error:Steps to Reproduce:
modin 0.10.0
:pip install -U modin==0.10.0
a. This also happens with the latest commit:pip install -U git+https://github.com/modin-project/modin
ray.init() df = pd.DataFrame() matrix = RayDMatrix(df, label="label") train({}, matrix)
2021-06-16 21:54:54,627 INFO services.py:1315 -- View the Ray dashboard at http://127.0.0.1:8265 UserWarning: Distributing <class 'NoneType'> object. This may take some time. 2021-06-16 21:54:56,290 INFO main.py:853 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training. Traceback (most recent call last): File "test.py", line 11, in
train({}, matrix)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 1267, in train
bst, train_evals_result, train_additional_results = _train(
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/main.py", line 859, in _train
dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 711, in assert_enough_shards_for_actors
self.loader.assert_enough_shards_for_actors(num_actors=num_actors)
File "/Users/matt/anaconda3/envs/ray/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 433, in assert_enough_shards_for_actors
raise RuntimeError(
RuntimeError: Trying to shard data for 4 actors, but the maximum number of shards is 0. If you want to shard the dataset by rows, consider centralized loading by passing
distributed=False
to theRayDMatrix
. Otherwise consider using fewer actors or re-partitioning your data.