pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

ENH: Faster merge_asof() through a single-pass algo #13902

Closed chrisaycock closed 8 years ago

chrisaycock commented 8 years ago

Out of curiosity, I took a crack at a single-pass merge_asof(). My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I use PyObjectHashTable while scanning through the right DataFrame to cache the most recently found row for each "by" object.

I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like ['ticker', 'exchange']. Still investigating.

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit hash c4302949:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                               41.07ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.90ms
[ 50.00%] · For pandas commit hash 97de42ab:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running join_merge.merge_asof_by.time_merge_asof_by                                              608.08ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           81.03ms
   before     after       ratio
  [97de42ab] [c4302949]
-   81.03ms    12.90ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  608.08ms    41.07ms      0.07  join_merge.merge_asof_by.time_merge_asof_by

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
jreback commented 8 years ago

so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.

chrisaycock commented 8 years ago

@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.

I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...

jreback commented 8 years ago

@chrisaycock you can use the groupby factorization (its quite cheap to do this)

In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})

In [6]: g = df.groupby(['A', 'B'])

In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)
chrisaycock commented 8 years ago

Using the setup from the join_merge.merge_asof_by benchmark:

In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop

But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:

In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop

In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop

The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[], then we need to pay attention to performance.)