Open gumdropsteve opened 4 years ago
Thank you for submitting this! train_test_split
for Dask is certainly a logical feature for cuML to add and is on our priority list for 0.15. Adding this issue to that release target.
In the meantime, if your data is already shuffled randomly across the cluster, Dask’s random_split
function does work with Dask_cudf.
Here’s an example of where this is used on TPCxBB: https://github.com/rapidsai/tpcx-bb/blob/master/tpcx_bb/queries/q28/tpcx_bb_query_28.py#L339
Yes, I would like this feature as well, simplifies use of scikit-learn like cuml functions. Currently I hit:
2020-10-21 19:45:33,533 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | File "/home/jon/miniconda/lib/python3.6/site-packages/cuml/preprocessing/model_selection.py", line 130, in train_test_split
2020-10-21 19:45:33,534 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | if X.shape[0] != y.shape[0]:
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | File "/home/jon/miniconda/lib/python3.6/site-packages/dask/delayed.py", line 563, in __bool__
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | raise TypeError("Truth of Delayed objects is not supported")
2020-10-21 19:45:33,535 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | TypeError: Truth of Delayed objects is not supported
2020-10-21 19:45:33,536 C: 25% D:367.0GB M:47.6GB NODE:LOCAL1 4613 DATA | ].
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Is your feature request related to a problem? Please describe. I wish I could use cuML to
train_test_split
a dask_cudf.DataFrame.Describe the solution you'd like
https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html
Describe alternatives you've considered Running normal train_test_split then making dask_cudf.DataFrames from the results.