rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.48k stars 907 forks source link

[ENH] Audit `argsort + gather/scatter` patterns for missing performance #13557

Open wence- opened 1 year ago

wence- commented 1 year ago

During review of #13419 we noted a few places where there is a pattern like:

some_frame = ...
indices = some_frame.some_column.argsort()
new_frame = some_frame.take(indices)

As well as the unnecessary bounds-check (see #13456), this is a pattern that is captured by libcudf's sort_by_key and stable_sort_by_key functions (we would want to use the latter in pandas-compat mode).

At present, libcudf implements this as a argsort of the key columns followed by a gather. But that's an implementation detail (there may in the future be updates to that implementation). In the Python layer we should "say what we mean" and call into the appropriate libcudf API.

A cursory search shows:

/usr/bin/rg -U -e argsort.\*'     
'.\*take

core/_base_index.py
1305:        indices = self.argsort(ascending=ascending, na_position=na_position)
1306:        index_sorted = self.take(indices)

core/indexed_frame.py
2464:                # double-argsort to map back from sorted to unsorted positions
2465:                df = df.take(index.argsort(ascending=True).argsort())

core/column/column.py
1382:        order = order.take(left_gather_map, check_bounds=False).argsort()
1383:        codes = codes.take(order)

core/groupby/groupby.py
686:            gather_map = ordering.take(to_take).argsort()
687:            return result.take(gather_map)

Of these, the calls in _base_index.py, _column.py, and groupby.py can definitely be replaced by sort_by_key. Note also that none of these calls pass check_bounds=False to take so incur an unnecessary kernel launch to check in-boundsness for something that is guaranteed in bounds.

The take(argsort().argsort()) pattern is not a sort_by_key, however, we can elide one of the argsorts by noticing that take is a gather operation and for a permutation, the dual to gather is scatter. So this should be implemented as df.scatter(index.argsort()) instead...

These are just the cases where an argsort is immediately followed by a take, probably more diligent searching would find more.

### Tasks
- [ ] `RollingGroupby.__init__`
- [ ] `Groupby._head_tail`
- [ ] `IndexedFrame.interpolate`
- [ ] `IndexedFrame.sort_index`
- [ ] `IndexedFrame._reindex`
- [ ] `IndexedFrame.sort_values`
- [ ] `IndexedFrame._n_largest_or_smallest`
- [ ] `BaseIndex.sort_values`
- [ ] `MultiIndex.__repr__`
bdice commented 1 year ago

We also have a related problem in the C++ layer, where we sort indices and then gather in places like segmented_sort_by_key_common. At the libcudf level, we have to consider nulls and nontrivial types (where a gather is needed for handling those) but the same general idea applies: sort directly wherever possible. https://github.com/rapidsai/cudf/pull/13669#discussion_r1268276705