Open tanaymeh opened 2 years ago
@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?
@heytanay you should be able to use scikit-learn's cross validators directly with a cuML model. Can you try that and let me know if it works?
Hi, If I pass in a cuDF dataframe to a scikit-learn cross validator, I get an "Implicit conversion to Numpy array" error. Following is the example snippet:
df = cudf.read_csv("train.csv")
X = df.drop(['target'], axis=1)
kfold = StratifiedKFold(n_splits=5)
for train_idx, valid_idx in kfold.split(X=X, y=df['target']):
print(train_idx, valid_idx)
Error:
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()
To get around this, I believe I will have to convert the cuDF dataframe to a normal pandas dataframe which would be impractical for large tabular files.
Thanks for filing this feature request. This is a limitation of the current approach.
Often when the data transfer time is non-trivial the estimator training time is the bulk of the time spent. E.g., time spent training CPU model on 3 GB dataset >> time spent transferring 3 GB dataset, enabling cuML to still provide large speedups. Are you in a scenario in which this isn't the case? Would you be able to share a bit more information?
@beckernick correct me if I am wrong, but shouldn't we be able to get away by using CuPy arrays, since they have the same mechanisms as NumPy arrays?
@beckernick I don't have a specific dataset in mind but there have been a few instances on Kaggle as well as off kaggle where I was dealing with really large datasets or when converting the data frame to pandas data frames would not be very practical.
That's the reason I proposed implementing these cross-validators.
@divyegala CuPy arrays will fail the internal _validate_data
checks in scikit-learn that ultimately call down to np.asarray
and run into the equivalent implicit conversion error. There's been some recent discussion about paths forward, though.
@heytanay thanks for the additional context. I agree that using GPU data structures in cross validators is a reasonable feature request. Just wanted to provide some color on the training time vs transfer time impact
@divyegala , I was thinking of the explicit cross-validator utilities like cross_val_score
. KFold
and similar functions should work, as you suggested.
from cuml.datasets import make_regression
from sklearn.model_selection import KFold
import cuml
X, y = make_regression()
clf = cuml.neighbors.KNeighborsRegressor()
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
98 99] TEST: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49] TEST: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
98 99]
Hi @beckernick, @divyegala Is this something that can be worked upon then? I would love to do the implementation of cross-validators if I have the green light.
@heytanay we would absolutely welcome a contribution from your side here. Let me know if I can be a resource to you in any way during your PR process, be it with questions about build, code, examples, etc.
@divyegala Thanks! I will start working on it and open a draft PR post-haste. I wanted to clear out a doubt: I don't suppose I'll have to write CUDA kernels here since in hindsight this looks like something that we can do using only Python.
Would love to get your views on it.
@heytanay yep, I don't foresee any need of CUDA here. You should be able to leverage features from cuML or our dependencies to directly build this feature out in Python.
@divyegala I've opened a draft PR here, currently, I have just copied all the necessary functions as-it-is from the scikit-learn source code to cuml, I will be adapting them to cuml and adding tests as we go.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Still working on this.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Still on this
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Hi everyone! Really sorry for closing this issue but I have been caught up in job and research work and won't be able to complete the implementation by myself. I have left the PR open (if anyone wants to pick up where I left off and complete the implementation).
No problem all! I'm going to reopen the issue, as this is still a valid feature request
How is it going? any progress on this? @beckernick
Some related work is happening in https://github.com/rapidsai/cuml/pull/5743 .
@trivialfis Really appreciate this work. Would also love to see cross_val_score implemented alongside cross_val_predict if possible.
Is your feature request related to a problem? Please describe. I would really love to see Cross validators such as KFold, StratifiedKFold, GroupKFold, etc in cuml. It will help make RAPIDS data science pipelines more independent of scikit-learn.
Describe the solution you'd like I would like to add the following in the first iteration (since there are many cross-validators in scikit-learn):
Describe alternatives you've considered I am currently not aware of any alternatives for using the above cross-validators natively in cuml.