pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.37k stars 170 forks source link

[ENH] improved performance for `first/last` in `conditional_join` #1382

Closed samukweku closed 4 months ago

samukweku commented 4 months ago

PR Description

Please describe the changes proposed in the pull request:

This PR improves conditional_join.

Performance tests - YMMV:

import pandas as pd
import janitor
In [21]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
    ...: df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})

In [32]: df1 = pd.concat([df1]*10_000)

In [33]: df2 = pd.concat([df2]*200)

# this PR
In [37]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
3.35 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [45]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B','<='),keep='first')
6.11 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# dev 
In [47]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B'
    ...: ,'<='),keep='first')
  def conditional_join(
300 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first')
57.6 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ericmjl commented 4 months ago

🚀 Deployed on https://deploy-preview-1382--pyjanitor.netlify.app