rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 908 forks source link

[BUG] Interchange `Column.describe_categorical` is a tuple, not a dict #11332

Open honno opened 2 years ago

honno commented 2 years ago

In the interchange protocol, describe_categorical should return a dict, but cuDF returns a tuple

>>> df = cudf.DataFrame({"foo": cudf.Series([0, 1], dtype="category")})
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.describe_categorical
(False, True, {0: 0, 1: 1})
>>> type(interchange_col.describe_categorical)
tuple  # should be dict

Relevant code returning a tuple as opposed to dict

https://github.com/rapidsai/cudf/blob/edc5062bdcc3e12755603b0ad07a4d271fe95261/python/cudf/cudf/core/df_protocol.py#L295

This prevents interchanging dataframes with categorical columns, e.g. with pandas

>>> from pandas.api.exchange import from_dataframe
>>> from_dataframe(df)
.../pandas/core/exchange/from_dataframe.py:184, in categorical_column_to_series(col)
    169 """
    170 Convert a column holding categorical data to a pandas Series.
    171 
   (...)
    180     that keeps the memory alive.
    181 """
    182 categorical = col.describe_categorical
--> 184 if not categorical["is_dictionary"]:
    185     raise NotImplementedError("Non-dictionary categoricals not supported yet")
    187 mapping = categorical["mapping"]
TypeError: tuple indices must be integers or slices, not str

pandas and modin are compliant here, but interestingly vaex currently returns a tuple too https://github.com/vaexio/vaex/issues/2113

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.