[FEA] dask train test split

gumdropsteve commented 4 years ago

Is your feature request related to a problem? Please describe. I wish I could use cuML to train_test_split a dask_cudf.DataFrame.

Describe the solution you'd like

from cuml.dask.preprocessing.model_selection import train_test_split

https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html

Describe alternatives you've considered Running normal train_test_split then making dask_cudf.DataFrames from the results.

JohnZed commented 4 years ago

Thank you for submitting this! train_test_split for Dask is certainly a logical feature for cuML to add and is on our priority list for 0.15. Adding this issue to that release target.

cjnolet commented 4 years ago

In the meantime, if your data is already shuffled randomly across the cluster, Dask’s random_split function does work with Dask_cudf.

Here’s an example of where this is used on TPCxBB: https://github.com/rapidsai/tpcx-bb/blob/master/tpcx_bb/queries/q28/tpcx_bb_query_28.py#L339

pseudotensor commented 4 years ago

Yes, I would like this feature as well, simplifies use of scikit-learn like cuml functions. Currently I hit:

2020-10-21 19:45:33,533 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   |   File "/home/jon/miniconda/lib/python3.6/site-packages/cuml/preprocessing/model_selection.py", line 130, in train_test_split
2020-10-21 19:45:33,534 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   |     if X.shape[0] != y.shape[0]:
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   |   File "/home/jon/miniconda/lib/python3.6/site-packages/dask/delayed.py", line 563, in __bool__
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   |     raise TypeError("Truth of Delayed objects is not supported")
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   | TypeError: Truth of Delayed objects is not supported
2020-10-21 19:45:33,536 C: 25% D:367.0GB M:47.6GB  NODE:LOCAL1      4613   DATA   | ].

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[FEA] dask train test split #2374