Closed samukweku closed 4 months ago
Please describe the changes proposed in the pull request:
first/last
conditional_join
This PR improves conditional_join.
Performance tests - YMMV:
import pandas as pd import janitor In [21]: df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]}) ...: df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]}) In [32]: df1 = pd.concat([df1]*10_000) In [33]: df2 = pd.concat([df2]*200) # this PR In [37]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first') 3.35 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [45]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B','<='),keep='first') 6.11 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # dev In [47]: %timeit df1.conditional_join(df2.sort_values(['value_2A','value_2B']), ('value_1','value_2A','>='), ('value_1','value_2B' ...: ,'<='),keep='first') def conditional_join( 300 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [48]: %timeit df1.conditional_join(df2.sort_values('value_2A'), ('value_1','value_2A','>='), keep='first') 57.6 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
🚀 Deployed on https://deploy-preview-1382--pyjanitor.netlify.app
PR Description
Please describe the changes proposed in the pull request:
first/last
inconditional_join
This PR improves conditional_join.
Performance tests - YMMV: