rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 907 forks source link

[PERF] Improve "isin" performance by only sorting once #11548

Open GregoryKimball opened 2 years ago

GregoryKimball commented 2 years ago

Is your feature request related to a problem? Please describe. While benchmarking cuDF-python, I noticed that bench_isin has low end-to-end data throughput (<10GB/s). A closer look at the profiles showed that the data is being sorted twice, first with .sort_values() and then as part of drop_duplicates(). The following profile is for a test dataframe with 1 col and 100K rows, and is uses the isin argument range(1000) in the bench_isin benchmark.

image

When calling isin with a dataframe or dict argument, the profile shows two calls to Frame.argsort.

image

Describe the solution you'd like For a performance improvement, I'd like to refactor isin to only sort the data once. We should prefer the libcudf unique function to drop_duplicates for pre-sorted data.

Describe alternatives you've considered n/a

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr commented 2 years ago

I see an extraneous sort_values call in Column._obtain_isin_result that can be removed, but I don't see any impact on performance locally when I make that change. Perhaps you can try that locally and see if it improves things. However, I suspect part of the problem is that we are explicitly sorting and then unsorting the data. I'd have to dig a bit deeper to understand exactly why on the previous line we are actually joining and requesting a sorted result only to unsort afterwards. I assume there's some pandas compatibility at play here, perhaps @galipremsagar or @shwina knows offhand.