rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 888 forks source link

[FEA] Refactor various conditional join implementations for simplicity and API consistency #10039

Open vyasr opened 2 years ago

vyasr commented 2 years ago

Is your feature request related to a problem? Please describe.

9917 and #10037 added new join functions that use a mixture of hash lookups and AST expression evaluation. In the interest of time, a number of important performance improvements, API design questions, and general internal refactorings were overlooked. This issue aims to track those potential improvements for future work.

Describe the solution you'd like

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

ZelboK commented 10 months ago

Hi, I'd like to try helping out with some of this work. I've a few questions to start off with though. For example, for the first unchecked task "Implement OOP APIs for mixed joins" is there an associated issue with this? A simple search through the available issues does not show anything.

Is there something wrong with the current API that makes it hard for downstream users to use? Is there a general direction for refactoring the API?

Also another example, I'm still new to cudf(I also admit to having never used it before either). If I am understanding correctly, if you remove the size API from the joins this could be a breaking change correct?

edit: Ok, I will take another glance when I have time and report back here if i run into any problems. I think the size API one should be reasonable for me to approach.

vyasr commented 10 months ago

Hi @ZelboK sorry for the slow response, let me try and answer your questions now:

For example, for the first unchecked task "Implement OOP APIs for mixed joins" is there an associated issue with this? A simple search through the available issues does not show anything.

There is not a separate issue for this, no. What this is referring to is that we have a hash_join class that may be used to perform multiple joins where one of the tables is always the same. By building the hash table once and then probing it repeatedly, we can amortize the cost of building the hash table. That object supports hash joins only right now, but mixed joins would also benefit from the same performance gains since they are part hash join.

Is there something wrong with the current API that makes it hard for downstream users to use? Is there a general direction for refactoring the API?

More generally here I think we'd like to migrate towards API names that are less confusing. It's not obvious what a "mixed" or "conditional" join is. Ideally we'd have the function signature tell us everything we need about the nature of the join (equality vs arbitrary AST-based condition).

If I am understanding correctly, if you remove the size API from the joins this could be a breaking change correct?

Yes, this would be a breaking change. In general we are OK with making breaking changes, though, so that's not a problem. We just need to communicate that effectively. In this case I don't expect the breaking change to affect many users because the current approach doesn't add any meaningful functionality.

ZelboK commented 10 months ago

All good. I haven't had time anyway. I will have some time this week though.

With that being said, are the devs active on any specific slack channel that isn't #general? There's #oss-development but not many people are in it. I personally would like to try these tasks out independently without guidance(so I can learn more), but I occasionally have some questions. Are these issues the best way to communicate?

bdice commented 10 months ago

Are these issues the best way to communicate?

Issues (or comments on pull requests) are good for communication about specific topics in the code or coordinating work. Slack is fine for general questions.

ZelboK commented 10 months ago

So for the removing size APIs, are you referring to these functions?

https://docs.rapids.ai/api/libcudf/nightly/group__column__join.html#ga00b702723fd8953d5de802bc37965525 https://docs.rapids.ai/api/libcudf/nightly/group__column__join.html#gaecfa4e8182521bb5630adf1bb0b609c2

I imagine this just means to remove them completely and their uses anywhere? Or is there more to this issue that I am missing? In the original paragraph I see the emphasis on wanting to transition over to a "single-kernel approach". IIRC mixed_join_semi calls several kernels so perhaps it is desired to refactor this to reduce the overhead of the additional kernels?

ZelboK commented 10 months ago

Hi,

I actually have some code set up. I intend on pushing a PR up hopefully this week should I have enough time. That way I can get feedback on whether or not my design is appropriate or not too. Is it possible to assign some of these issues to myself?

bdice commented 10 months ago

This is a meta-issue with several independent tasks. We can cross-link the PR in the checklist and assign the PR to you once it’s open. Thanks!