Closed wphicks closed 3 weeks ago
I think the problem is due to our custom function for __iter__
in our pd.Series
proxy type. The loop for t in df.dtypes
calls __iter__
which (for our proxy type) always uses the underlying slow objects __iter__
method. I'm not sure why we're using a custom iterator for pd.Series
, maybe we shouldn't?
Okay removing the custom iterator made your minimum repro work, but It could break other things (we'll see).
In [1]: %load_ext cudf.pandas
In [2]: import pandas as pd
...:
...: df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
...: df["A"] = df["A"].astype('category')
...:
...: print("In for loop: ", [isinstance(t, pd.CategoricalDtype) for t in df.dtypes][0])
...: print("With iloc: ", isinstance(df.dtypes.iloc[0], pd.CategoricalDtype))
In for loop: True
With iloc: True
Describe the bug When using cudf.pandas and iterating over the dtypes of a dataframe, categorical dtype objects are reported as
cudf.CategoricalDtype
and notpandas.CategoricalDtype
, causingisinstance
checks to fail unexpectedly.Steps/Code to reproduce bug Run the following using
python -m cudf.pandas
and compare to output withoutcudf.pandas
Expected behavior Output should be the same for the
isinstance
checks with and withoutcudf.pandas
and regardless of whether or not we are iterating over dtypes or selecting them by index.Environment details (please complete the following information):
conda list
Output:Additional context This prevents training an XGBoost model on categorical variables using
cudf.pandas
if the.plot
method of aSeries
has been called beforehand. See #17166 for information on unexpected behavior from.plot
.