rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[BUG] train_test_split with stratify cuml dataframe #4705

Open nyoungstudios opened 2 years ago

nyoungstudios commented 2 years ago

Describe the bug A clear and concise description of what the bug is. cuml.model_selection.train_test_split does not work like sklearn.model_selection.train_test_split when passing a dataframe to stratify

Steps/Code to reproduce bug

import sklearn.model_selection
import pandas as pd

items = [
    {"value": 0, "name": "good"},
    {"value": 1, "name": "good"},
    {"value": 2, "name": "good"},
    {"value": 3, "name": "good"},
    {"value": 4, "name": "good"},
    {"value": 5, "name": "bad"},
    {"value": 6, "name": "bad"},
    {"value": 7, "name": "bad"},
    {"value": 8, "name": "bad"},
    {"value": 9, "name": "bad"},
]

df = pd.DataFrame(items)

# works properly passing the dataframe to stratify on
train, test = sklearn.model_selection.train_test_split(df, test_size=0.5, stratify=df[["name"]])

import cudf
import cuml.model_selection

df2 = cudf.DataFrame(items)

# raises exception
train, test = cuml.model_selection.train_test_split(df2, test_size=0.5, stratify=df2[["name"]])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_131662/3302210619.py in <module>
----> 1 train, test = cuml.model_selection.train_test_split(df2, test_size=0.5, stratify=df2[["name"]])

/lib/python3.7/site-packages/cuml/model_selection/_split.py in train_test_split(X, y, test_size, train_size, shuffle, random_state, stratify)
    438                                            x_numba,
    439                                            y_numba,
--> 440                                            random_state)
    441             return split_return
    442 

/lib/python3.7/site-packages/cuml/model_selection/_split.py in _stratify_split(X, stratify, labels, n_train, n_test, x_numba, y_numba, random_state)
     62     elif isinstance(stratify, cudf.DataFrame):
     63         # ensuring it has just one column
---> 64         if labels.shape[1] != 1:
     65             raise ValueError('Expected one column for labels, but found df'
     66                              'with shape = %d' % (labels.shape))

AttributeError: 'NoneType' object has no attribute 'shape'
# this also works well
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(df[["value"]], df[["name"]], test_size=0.5, stratify=df[["name"]])

# but this returns an exception as well
X_train, X_test, y_train, y_test = cuml.model_selection.train_test_split(df2[["value"]], df2[["name"]], test_size=0.5, stratify=df2[["name"]])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_131662/1980875442.py in <module>
----> 1 X_train, X_test, y_train, y_test = cuml.model_selection.train_test_split(df2[["value"]], df2[["name"]], test_size=0.5, stratify=df2[["name"]])

/lib/python3.7/site-packages/cuml/model_selection/_split.py in train_test_split(X, y, test_size, train_size, shuffle, random_state, stratify)
    438                                            x_numba,
    439                                            y_numba,
--> 440                                            random_state)
    441             return split_return
    442 

/lib/python3.7/site-packages/cuml/model_selection/_split.py in _stratify_split(X, stratify, labels, n_train, n_test, x_numba, y_numba, random_state)
     66                              'with shape = %d' % (labels.shape))
     67         labels_cudf = True
---> 68         labels = labels[0].values
     69 
     70     labels_order = _strides_to_order(

/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/lib/python3.7/site-packages/cudf/core/dataframe.py in __getitem__(self, arg)
    681         """
    682         if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
--> 683             return self._get_columns_by_label(arg, downcast=True)
    684 
    685         elif isinstance(arg, slice):

/lib/python3.7/site-packages/cudf/core/dataframe.py in _get_columns_by_label(self, labels, downcast)
   1574         If downcast is True, try and downcast from a DataFrame to a Series
   1575         """
-> 1576         new_data = super()._get_columns_by_label(labels, downcast)
   1577         if downcast:
   1578             if is_scalar(labels):

/lib/python3.7/site-packages/cudf/core/frame.py in _get_columns_by_label(self, labels, downcast)
    524 
    525         """
--> 526         return self._data.select_by_label(labels)
    527 
    528     def _get_columns_by_index(self, indices):

/lib/python3.7/site-packages/cudf/core/column_accessor.py in select_by_label(self, key)
    344                 if any(isinstance(k, slice) for k in key):
    345                     return self._select_by_label_with_wildcard(key)
--> 346             return self._select_by_label_grouped(key)
    347 
    348     def select_by_index(self, index: Any) -> ColumnAccessor:

/lib/python3.7/site-packages/cudf/core/column_accessor.py in _select_by_label_grouped(self, key)
    406 
    407     def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 408         result = self._grouped_data[key]
    409         if isinstance(result, cudf.core.column.ColumnBase):
    410             return self.__class__({key: result})

KeyError: 0

Expected behavior cuml.model_selection.train_test_split should not raise an error and properly stratify the dataframe

Environment details (please complete the following information):

Additional context Add any other context about the problem here.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.