rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[FEA] Handling non-numerical dtype Series as inputs #4169

Open sarahyurick opened 3 years ago

sarahyurick commented 3 years ago

In SKLearn, I am able to do the following (using the Iris dataset, which I downloaded from here).

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("./iris.csv")
X_train = df.drop(['species'], axis=1)
y_train = df['species']
model = LogisticRegression()
model.fit(X_train, y_train)

However, when I try the same in cuML, I get an error:

import cudf
from cuml.linear_model import LogisticRegression

df = cudf.read_csv("./iris.csv")
X_train = df.drop(['species'], axis=1)
y_train = df['species']
model = LogisticRegression()
model.fit(X_train, y_train)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_7485/2582582848.py in <module>
      2 
      3 model = LogisticRegression()
----> 4 model.fit(X_train, y_train)

~/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
    407                                 target_val=target_val)
    408 
--> 409                 return func(*args, **kwargs)
    410 
    411         @wraps(func)

cuml/linear_model/logistic_regression.pyx in cuml.linear_model.logistic_regression.LogisticRegression.fit()

~/miniconda3/envs/rapids-21.08/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

~/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner(*args, **kwargs)
    358         def inner(*args, **kwargs):
    359             with self._recreate_cm(func, args):
--> 360                 return func(*args, **kwargs)
    361 
    362         return inner

~/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cuml/common/input_utils.py in input_to_cuml_array(X, order, deepcopy, check_dtype, convert_to_dtype, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous)
    338 
    339     elif hasattr(X, "__array_interface__") or \
--> 340             hasattr(X, "__cuda_array_interface__"):
    341 
    342         host_array = hasattr(X, "__array_interface__")

~/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/frame.py in __cuda_array_interface__(self)
   3867     @property
   3868     def __cuda_array_interface__(self):
-> 3869         return self._column.__cuda_array_interface__
   3870 
   3871     def factorize(self, na_sentinel=-1):

~/miniconda3/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/column.py in __cuda_array_interface__(self)
   1028     @property
   1029     def __cuda_array_interface__(self):
-> 1030         raise NotImplementedError(
   1031             f"dtype {self.dtype} is not yet supported via "
   1032             "`__cuda_array_interface__`"

NotImplementedError: dtype object is not yet supported via `__cuda_array_interface__`

After discussion with @dantegd and @galipremsagar we determined that this will happen to str or categorical dtype Series, as it goes through the conditional hasattr(X, "__array_interface__") or hasattr(X, "__cuda_array_interface__") (in the input_to_cuml_array() function in cuml/common/input_utils.py), so we need to make sure that the dtype is numerical.

For further reference, the Series that is causing the error looks like:

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object
github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

beckernick commented 1 year ago

A user ran into this issue today while trying to use Random Forest