Open GregoryKimball opened 2 years ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I see an extraneous sort_values
call in Column._obtain_isin_result
that can be removed, but I don't see any impact on performance locally when I make that change. Perhaps you can try that locally and see if it improves things. However, I suspect part of the problem is that we are explicitly sorting and then unsorting the data. I'd have to dig a bit deeper to understand exactly why on the previous line we are actually joining and requesting a sorted result only to unsort afterwards. I assume there's some pandas compatibility at play here, perhaps @galipremsagar or @shwina knows offhand.
Is your feature request related to a problem? Please describe. While benchmarking cuDF-python, I noticed that bench_isin has low end-to-end data throughput (<10GB/s). A closer look at the profiles showed that the data is being sorted twice, first with
.sort_values()
and then as part ofdrop_duplicates()
. The following profile is for a test dataframe with 1 col and 100K rows, and is uses theisin
argumentrange(1000)
in thebench_isin
benchmark.When calling
isin
with a dataframe or dict argument, the profile shows two calls toFrame.argsort
.Describe the solution you'd like For a performance improvement, I'd like to refactor
isin
to only sort the data once. We should prefer the libcudfunique
function todrop_duplicates
for pre-sorted data.Describe alternatives you've considered n/a
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.