rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[FEA] Add Cross Validators to cuml #4662

Open tanaymeh opened 2 years ago

tanaymeh commented 2 years ago

Is your feature request related to a problem? Please describe. I would really love to see Cross validators such as KFold, StratifiedKFold, GroupKFold, etc in cuml. It will help make RAPIDS data science pipelines more independent of scikit-learn.

Describe the solution you'd like I would like to add the following in the first iteration (since there are many cross-validators in scikit-learn):

Describe alternatives you've considered I am currently not aware of any alternatives for using the above cross-validators natively in cuml.

divyegala commented 2 years ago

@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?

tanaymeh commented 2 years ago

@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?

Hi, If I pass in a cuDF dataframe to a scikit-learn cross validator, I get an "Implicit conversion to Numpy array" error. Following is the example snippet:

df = cudf.read_csv("train.csv")

X = df.drop(['target'], axis=1)

kfold = StratifiedKFold(n_splits=5)
for train_idx, valid_idx in kfold.split(X=X, y=df['target']):
    print(train_idx, valid_idx)

Error:

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()

To get around this, I believe I will have to convert the cuDF dataframe to a normal pandas dataframe which would be impractical for large tabular files.

beckernick commented 2 years ago

Thanks for filing this feature request. This is a limitation of the current approach.

Often when the data transfer time is non-trivial the estimator training time is the bulk of the time spent. E.g., time spent training CPU model on 3 GB dataset >> time spent transferring 3 GB dataset, enabling cuML to still provide large speedups. Are you in a scenario in which this isn't the case? Would you be able to share a bit more information?

divyegala commented 2 years ago

@beckernick correct me if I am wrong, but shouldn't we be able to get away by using CuPy arrays, since they have the same mechanisms as NumPy arrays?

tanaymeh commented 2 years ago

@beckernick I don't have a specific dataset in mind but there have been a few instances on Kaggle as well as off kaggle where I was dealing with really large datasets or when converting the data frame to pandas data frames would not be very practical.

That's the reason I proposed implementing these cross-validators.

beckernick commented 2 years ago

@divyegala CuPy arrays will fail the internal _validate_data checks in scikit-learn that ultimately call down to np.asarray and run into the equivalent implicit conversion error. There's been some recent discussion about paths forward, though.

@heytanay thanks for the additional context. I agree that using GPU data structures in cross validators is a reasonable feature request. Just wanted to provide some color on the training time vs transfer time impact

beckernick commented 2 years ago

@divyegala , I was thinking of the explicit cross-validator utilities like cross_val_score. KFold and similar functions should work, as you suggested.

from cuml.datasets import make_regression
from sklearn.model_selection import KFold
import cuml

X, y = make_regression()

clf = cuml.neighbors.KNeighborsRegressor()
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99] TEST: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49] TEST: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99]
tanaymeh commented 2 years ago

Hi @beckernick, @divyegala Is this something that can be worked upon then? I would love to do the implementation of cross-validators if I have the green light.

divyegala commented 2 years ago

@heytanay we would absolutely welcome a contribution from your side here. Let me know if I can be a resource to you in any way during your PR process, be it with questions about build, code, examples, etc.

tanaymeh commented 2 years ago

@divyegala Thanks! I will start working on it and open a draft PR post-haste. I wanted to clear out a doubt: I don't suppose I'll have to write CUDA kernels here since in hindsight this looks like something that we can do using only Python.

Would love to get your views on it.

divyegala commented 2 years ago

@heytanay yep, I don't foresee any need of CUDA here. You should be able to leverage features from cuML or our dependencies to directly build this feature out in Python.

tanaymeh commented 2 years ago

@divyegala I've opened a draft PR here, currently, I have just copied all the necessary functions as-it-is from the scikit-learn source code to cuml, I will be adapting them to cuml and adding tests as we go.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

tanaymeh commented 2 years ago

Still working on this.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

tanaymeh commented 2 years ago

Still on this

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

tanaymeh commented 2 years ago

Hi everyone! Really sorry for closing this issue but I have been caught up in job and research work and won't be able to complete the implementation by myself. I have left the PR open (if anyone wants to pick up where I left off and complete the implementation).

beckernick commented 2 years ago

No problem all! I'm going to reopen the issue, as this is still a valid feature request

AnVuTrong commented 8 months ago

How is it going? any progress on this? @beckernick

trivialfis commented 8 months ago

Some related work is happening in https://github.com/rapidsai/cuml/pull/5743 .

ZeroCool2u commented 3 months ago

@trivialfis Really appreciate this work. Would also love to see cross_val_score implemented alongside cross_val_predict if possible.