Describe the bug
When using cudf.pandas and iterating over the dtypes of a dataframe, categorical dtype objects are reported as cudf.CategoricalDtype and not pandas.CategoricalDtype, causing isinstance checks to fail unexpectedly.
Steps/Code to reproduce bug
Run the following using python -m cudf.pandas and compare to output without cudf.pandas
import pandas as pd
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["A"] = df["A"].astype('category')
print("In for loop: ", [isinstance(t, pd.CategoricalDtype) for t in df.dtypes][0])
print("With iloc: ", isinstance(df.dtypes.iloc[0], pd.CategoricalDtype))
$ python repro.py
In for loop: True
With iloc: True
$ python -m cudf.pandas repro.py
In for loop: False
With iloc: True
Expected behavior
Output should be the same for the isinstance checks with and without cudf.pandas and regardless of whether or not we are iterating over dtypes or selecting them by index.
Environment details (please complete the following information):
Environment location: GCP g2-standard-8 instance
Linux Distro/Architecture: Debian 11 Bullseye amd64
GPU Model/Driver: L4 / 550.90.07
CUDA: 12.4
Method of cuDF & cuML install: conda (RAPIDS 24.10)
Additional context
This prevents training an XGBoost model on categorical variables using cudf.pandas if the .plot method of a Series has been called beforehand. See separate issue for information on unexpected behavior from .plot.
Describe the bug When using cudf.pandas and iterating over the dtypes of a dataframe, categorical dtype objects are reported as
cudf.CategoricalDtype
and notpandas.CategoricalDtype
, causingisinstance
checks to fail unexpectedly.Steps/Code to reproduce bug Run the following using
python -m cudf.pandas
and compare to output withoutcudf.pandas
Expected behavior Output should be the same for the
isinstance
checks with and withoutcudf.pandas
and regardless of whether or not we are iterating over dtypes or selecting them by index.Environment details (please complete the following information):
conda list
Output:Additional context This prevents training an XGBoost model on categorical variables using
cudf.pandas
if the.plot
method of aSeries
has been called beforehand. See separate issue for information on unexpected behavior from.plot
.