rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 907 forks source link

[FEA] Support left-semi and left-anti joins in `cudf::hash_join` #13700

Open gaohao95 opened 1 year ago

gaohao95 commented 1 year ago

Is your feature request related to a problem? Please describe.

cudf::hash_join makes it possible to build the hash table once and probe it multiple times. But it only supports inner join, left join and full join. I wish cudf::hash_join can support left-semi and left-anti join as well.

GregoryKimball commented 1 year ago

Thank you @gaohao95 for suggesting this. We will do some scoping and return to this request.

vyasr commented 1 year ago

There are some important limitations to be aware of.

gaohao95 commented 1 year ago

Thanks @vyasr! Those are good points!

Therefore, the reuse would be limited between two disjoint sets of APIs: semi_join could reuse a map built for anti_join and vice versa, but it could not use a multimap built for inner/left/full joins.

In my use case (broadcast join) this is fine. An object is only needed to probe a single join type.

There is ongoing work to refactor cuco data structures and expand their usage within libcudf. I would not recommend making any changes to the join APIs until that work is further along.

This is not a blocker for us so we can wait.

ahmet-uyar commented 2 months ago

We also needed this recently for a broadcast join implementation. We would prefer if we cudf::hash_join supports left-semi join.