rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 887 forks source link

[FEA] Implement argument-order stable option for `cudf::merge::merge` #16010

Open wence- opened 3 months ago

wence- commented 3 months ago

Is your feature request related to a problem? Please describe.

In implementing the cudf-polars interpreter, I would like to use cudf::merge::merge to implement the polars merge_sorted functionality. Where polars guarantees that for equal keys in the left and right dataframes, merge_sorted is stable (that is, equal keys, and hence carrier columns, from the left dataframe appear in the result before those in the right frame), libcudf makes no such guarantee. Note that this is only a binary merge.

cudf::merge::merge is n-ary, rather than binary, and uses a priority queue, based on the length of the input (and intermediate) dataframes that are produced. Hence, we can't guarantee that the result is stable wrt the original input (in the sense described above) without some gymnastics.

Describe the solution you'd like

I'd like a stable version of merge for the binary case. In this case there's no real advantage to be gained by picking the smaller dataframe as the left one (because there's only one round of merges), and so the implementation is cheap.

For stable n-ary merge, I think there's no way around adding an additional disambiguating key column to each table that becomes part of the comparator.

Describe alternatives you've considered

Always adding the disambiguating key column.

bdice commented 3 months ago

I am thinking about filing a PR for this. Let me know if you have a timeline/prioritization in mind.

wence- commented 3 months ago

I am thinking about filing a PR for this. Let me know if you have a timeline/prioritization in mind.

Def not P0, I can implement on the interpreter side whenever. It's not blocking any milestones atm