rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 907 forks source link

[FEA] Adding support for categorical column indexes #8743

Open charlesbluca opened 3 years ago

charlesbluca commented 3 years ago

Is your feature request related to a problem? Please describe. Categorical column indexes exists in a weird place of quasi-support in cuDF; while it is possible to set a dataframe's column index to be a pd.CategoricalIndex without any error or warning, it isn't actually possible for the index to be recreated with df.columns, which contrasts the behavior of Pandas:

import cudf
import pandas as pd

pdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
pdf.columns = pdf.columns.astype("category")

gdf = cudf.from_pandas(pdf)

print(pdf.columns)  # CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')
print(gdf.columns)  # Index(['a', 'b'], dtype='object')

This means that while there are user-facing issues which come as a result of using cuDF's "categorical" column indexes (such as #7365), the ability to test for them is limited in that we cannot do the standard comparison to Pandas dataframes here:

from cudf.testing._utils import assert_eq

assert_eq(pdf, gdf)  # AssertionError: DataFrame.columns are different

Describe the solution you'd like After chatting with @shwina, it seems like an ideal solution that can't be done here is to use the individual categorical scalars instead of their string names as data when constructing the ColumnAccessor in the columns setter method. However, this isn't possible, as neither Pandas nor cuDF offer categorical scalars.

An alternative to this would be to have a boolean attribute either of the dataframe or ColumnAccessor saying whether or not the column index is categorical; this could then be used by ColumnAccessor.to_pandas_index()to properly reconstruct the index with categories if needed. This would come with its own consequences, specifically either

Describe alternatives you've considered A possible alternative that @shwina and I explored, but were unable to get working, is to pass specific kwargs to assert_eq such that it would only check the column index names, but not the index type. Passing different combos of check_categorical=False, check_column_type=False, etc. we were unable to get a passing test when comparing these indexes.

Additional context This issue came up while working on #8560, where added test cases would require this feature and needed to be xfailed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.