Open eriknw opened 2 years ago
This works now, but with the caveat that categorical column must be explicitly marked as ordered
In [1]: import cudf
...: import dask_cudf
...: df = cudf.DataFrame({"a": list("caba"), "b": list(range(4))})
...: df["a"] = df["a"].astype("category").cat.as_ordered() # Without ordering the dask version fails.
...: ddf = dask_cudf.from_cudf(df, npartitions=2)
...: print(df.sort_values("a"))
...: print(ddf.sort_values("a").compute())
a b
1 a 1
3 a 3
2 b 2
0 c 0
a b
1 a 1
3 a 3
2 b 2
0 c 0
@rjzamora any idea why dask_cudf behaves differently from cudf w.r.t. the ordering?
Good catch @vyasr - The dask behavior was actually "fixed" recently in dask-expr (https://github.com/dask/dask-expr/pull/1058), but I just realized that the pd.CategoricalDtype
check will need to be updated to work for cudf (my mistake for missing that when I reviewed).
Even with dask-expr fixed, however, your snippet will not work for dask_cudf, because there seems to be a bug in cudf:
import cudf as lib # Works for pandas, but not for cudf
df = lib.DataFrame({"a": list("caba"), "b": list(range(4))})
df["a"] = df["a"].astype("category")
df = df.iloc[:2]
df["a"].cat.as_ordered()
...
ValueError: Length of values (4) does not match length of index (2)
EDIT: I submitted https://github.com/rapidsai/cudf/issues/15778 to track this.
Update: Latest version of dask-expr:main
+ dask:main
now results in an ugly segfault when sorting on a categorical column. After https://github.com/rapidsai/cudf/pull/15788, the user will get a clear error until the upstream divisions logic is "generalized" to work with cudf.
The chain has gotten a bit long here, let me summarize to make sure I have everything right. #15780 will fix #15778. Once that is merged, will this issue also be fixed in the dask-expr case, or is there still work to be done to generalize dask-expr to work correctly for cudf because https://github.com/dask/dask-expr/pull/1058 wasn't complete? And in either case, do we still expect this to fail for users of the legacy dask API (which I guess isn't too important if we're going to be forced to migrate to dask-expr anyway)?
Summary:
Sorting by a categorical column works fine with dask-cudf, but only for the "legacy" API
Until recently, categorical sorting has been broken for both pandas and cudf-backed data in dask-expr
In order to support cudf-backed data, we essentially need the RepartitionQuantiles
logic to work for both cudf and pandas data
RepartitionQuantiles
, but it didn't fix everythingI certainly want to fix categorical sorting for 24.06 if possible, but my current expectation is that we will need to raise an error and tell the user to disable query planning. If I can find a work-around in the next day or so, then we can remove the error. Otherwise, the proper/upstream fix will only apply to 24.08.
Describe the bug
ddf.sort_values(col)
does not work with adask_cudf
DataFrame whencol
is categorical.Steps/Code to reproduce bug
Traceback
```python-traceback --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [7], line 1 ----> 1 ddf.sort_values("a") File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.Expected behavior I expect it to work--that is, match the result of cudf and dask.dataframe.
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context Encountered in ProperterGraph in cugraph.