[FEA] Add support for strings in cuml.preprocessing.SimpleImputer #4786

Open kshitizgupta21 opened 2 years ago

kshitizgupta21 commented 2 years ago

Is your feature request related to a problem? Please describe. I'm trying to use cuml preprocessing's SimpleImputer to impute string columns in v22.04. The docs image

mention that both constant and most_frequent strategies are supported for string columns but when I try to use them I get this error:

TypeError: String Arrays is not yet implemented in cudf

Here is the complete output

from cuml.preprocessing import SimpleImputer
# Merchant State and Zip are type object columns
string_cols = ["Merchant State", "Zip"]

for col in string_cols:
    imputer = SimpleImputer(strategy="most_frequent")
    X_train[[col]] = imputer.fit_transform(X_train[[col]])
    X_test[[col]] = imputer.transform(X_test[[col]])

for col in string_cols:
    imputer = SimpleImputer(strategy="constant", fill_value='UNKNOWN')
    X_train[[col]] = imputer.fit_transform(X_train[[col]])
    X_test[[col]] = imputer.transform(X_test[[col]])

Describe the solution you'd like String column imputation to go smoothly

beckernick commented 2 years ago

Thanks for raising this issue. This is likely a documentation error, as this functionality is currently designed for numeric data in cuML.

If you share a reproducible example, someone may be able to advise on a workaround.

LiamMadigan-EN0107 commented 2 years ago

A reproducible example can be taken from the cuml.compose.make_column_selector() example in the API reference guide:


The original example is:

from cuml.preprocessing import StandardScaler, OneHotEncoder from cuml.preprocessing import make_column_transformer from cuml.preprocessing import make_column_selector import cupy as cp import cudf
X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], 'rating': [5, 3, 4, 5]})
ct = make_column_transformer( (StandardScaler(), make_column_selector(dtype_include=cp.number)), # rating (OneHotEncoder(), make_column_selector(dtype_include=object))) # city ct.fit_transform(X)

By changing the example to: X = cudf.DataFrame({'city': ['London', np.nan, 'Paris', 'Sallisaw'], 'rating': [5, 3, 4, 5]})
ct = make_column_transformer( (StandardScaler(), make_column_selector(dtype_include=cp.number)), (SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Other'), make_column_selector(dtype_include=object)), (OneHotEncoder(), make_column_selector(dtype_include=object))) # city ct.fit_transform(X)

This gives the error message shown by kshitizgupta21 above.

As an aside. I can also get the same error message, when I add the "remainder='passthrough'" argument to the original example, ie:

X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], 'rating': [5, 3, 4, 5]})
ct = make_column_transformer( (StandardScaler(), make_column_selector(dtype_include=cp.number)), # rating (OneHotEncoder(), make_column_selector(dtype_include=object)), remainder='passthrough') # city ct.fit_transform(X)

I'm using a docker container, converted into an AWS Sagemaker Studio Kernel. The docker version is:


beckernick commented 2 years ago

Thanks for providing a reproducible example @LiamMadigan-EN0107 ! We'll evaluate the feasibility of including string support with constant and most_frequent strategies.

In the short term, would you be interested in contributing a PR to update the documentation to indicate strings are not yet supported?

jacklinsibiyal commented 4 weeks ago

I also got the same error.

TypeError: String Arrays is not yet implemented in cudf