Open kshitizgupta21 opened 2 years ago
Thanks for raising this issue. This is likely a documentation error, as this functionality is currently designed for numeric data in cuML.
If you share a reproducible example, someone may be able to advise on a workaround.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
A reproducible example can be taken from the cuml.compose.make_column_selector() example in the API reference guide:
https://docs.rapids.ai/api/cuml/stable/api.html#text-preprocessing-single-gpu.
The original example is:
from cuml.preprocessing import StandardScaler, OneHotEncoder
from cuml.preprocessing import make_column_transformer
from cuml.preprocessing import make_column_selector
import cupy as cp
import cudf
X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
'rating': [5, 3, 4, 5]})
ct = make_column_transformer(
(StandardScaler(),
make_column_selector(dtype_include=cp.number)), # rating
(OneHotEncoder(),
make_column_selector(dtype_include=object))) # city
ct.fit_transform(X)
By changing the example to:
X = cudf.DataFrame({'city': ['London', np.nan, 'Paris', 'Sallisaw'],
'rating': [5, 3, 4, 5]})
ct = make_column_transformer(
(StandardScaler(),
make_column_selector(dtype_include=cp.number)),
(SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Other'),
make_column_selector(dtype_include=object)),
(OneHotEncoder(),
make_column_selector(dtype_include=object))) # city
ct.fit_transform(X)
This gives the error message shown by kshitizgupta21 above.
As an aside. I can also get the same error message, when I add the "remainder='passthrough'" argument to the original example, ie:
X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
'rating': [5, 3, 4, 5]})
ct = make_column_transformer(
(StandardScaler(),
make_column_selector(dtype_include=cp.number)), # rating
(OneHotEncoder(),
make_column_selector(dtype_include=object)),
remainder='passthrough') # city
ct.fit_transform(X)
I'm using a docker container, converted into an AWS Sagemaker Studio Kernel. The docker version is:
nvcr.io/nvidia/rapidsai/rapidsai:22.08-cuda11.5-runtime-ubuntu20.04-py3.8
Thanks for providing a reproducible example @LiamMadigan-EN0107 ! We'll evaluate the feasibility of including string support with constant
and most_frequent
strategies.
In the short term, would you be interested in contributing a PR to update the documentation to indicate strings are not yet supported?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I also got the same error.
TypeError: String Arrays is not yet implemented in cudf
Is your feature request related to a problem? Please describe. I'm trying to use cuml preprocessing's SimpleImputer to impute string columns in v22.04. The docs
mention that both
constant
andmost_frequent
strategies are supported for string columns but when I try to use them I get this error:TypeError: String Arrays is not yet implemented in cudf
Here is the complete output
Describe the solution you'd like String column imputation to go smoothly