rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.35k stars 891 forks source link

[FEA] Use libcudf Dictionary type for CategoricalColumn in Python #8573

Open beckernick opened 3 years ago

beckernick commented 3 years ago

cuDF Python would like to back the CategoricalColumn with the Dictionary type. Work has been initiated toward this goal in https://github.com/rapidsai/cudf/pull/8567

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

wence- commented 1 year ago

This desire came up again recently in relation to #14138, where it is noted that we implement a lot of "heavyweight" algorithms as a sequence of calls in Python, rather than pushing down into libcudf.

@isVoid's implementation work in #8567 stalled due to some differences in the way libcudf and pandas (and hence cudf) choose to model dictionary-encoded columns.

In libcudf, the keys of the dictionary are required to be sorted, and the encoding looks up the value by indexing into the keys array. This restricts dictionary encoding to keys that admit a total order, and (I think) doesn't have a hook for a user-provided comparator.

In pandas, categoricals (dictionary encoded columns) come in two flavours

  1. ordered
  2. unordered

The latter do not require that the keys admit a total order (or indeed a partial one), and can be applied even in the case where the key type does have a "natural" ordering, e.g.:

n [5]: col = pd.Categorical([1, 2, 3], ordered=False)

In [6]: col.min() # => TypeError

Ordered categoricals either use the natural ordering induced by the key type (this matches libcudf), or allow for a user-defined ordering. This enables the user to impose a total order on naturally unordered key types (for example floats), and/or provide one that is different from the natural order:

col = pd.Categorical([3, 2, 1], ordered=True)
col.min() # => 1

col = pd.Categorical([3, 2, 1], categories=[3, 1, 2], ordered=True)
col.min() # => 3

AIUI, it was interfacing these differences that caused too many hacks/workarounds on the python side.

In light of this, we should consider if the libcudf side would need some extensions to support cudf's use case of dictionary encoding. Or if there is a smart way of managing things in a translation layer that doesn't require huge amounts of special-casing.

vyasr commented 1 year ago

Another reason Michael's work stalled is that due to the fact that it's not directly mapping to a libcudf type categorical data in cudf is special-cased all over the place and therefore requires a large amount of work to track. We were hoping that it would be simpler to work on that after we had refactored cudf internals to a place where the categorical logic was better isolated to just the categorical column, or at least more contained in some other way. I'm not opposed to revisiting the work now, but just an FYI that I'd hope this would become substantially easier after we restructure cudf internals around pylibcudf over the next couple of releases.