Closed charlesbluca closed 2 years ago
Oof, good find. Global order isn't preserved here.
I suspect this might be the problem in the testing utility: https://github.com/dask/dask/blob/9fc5777f3d83f1084360adf982da301ed4afe13b/dask/dataframe/utils.py#L553-L554
EDIT:
a = _maybe_sort(a)
b = _maybe_sort(b)
tm.assert_frame_equal(a, b, check_dtype=check_dtype, **kwargs)
Once we're in pandas-land, sorting will return the global index order for both dataframes.
Thanks for the quick find @beckernick! Looks like we can get around this behavior by setting check_index=False
, but I'd imagine for cases where we want to compare the sorting and the index, it would be nice to have a kwarg like check_order
that can be used to enable/disable the calls to _maybe_sort
altogether.
For now, I think the best short-term option would be to replace any instances of
dd.assert_eq(got, expect)
With something like
from cudf.testing._utils import assert_eq
assert_eq(got.compute(), expect)
Describe the bug When doing multi-column sorting with a dask-cudf dataframe containing nulls, the ordering is incorrect:
Expected behavior The proper ordering:
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context This bug should've been caught in this test. However, for some reason
dask.dataframe.assert_eq
doesn't raise an error for these differently sorted dataframes: