Closed chrisaycock closed 8 years ago
so you don't actually need tempita here. you factorize things, and so only need to deal with int64's.
@jreback The single-pass nature of this is that I'm not doing the factorizing anymore. I'm comparing the values in the "on" column directly, which is fine since timestamps are stored as integers anyway. But if I ever want to compare floats, then I assume I'll need proper type differentiation.
I've issued a PR for the sample code to show how I did it. As I describe at the top of message there, do not merge in its current state...
@chrisaycock you can use the groupby factorization (its quite cheap to do this)
In [5]: df = pd.DataFrame({'A' : pd.date_range('20130101',periods=3), 'B' : list('aab'), 'C' : range(3)})
In [6]: g = df.groupby(['A', 'B'])
In [7]: g.grouper.group_info
Out[7]: (array([0, 1, 2], dtype=int64), array([0, 2, 5], dtype=int64), 3)
Using the setup from the join_merge.merge_asof_by
benchmark:
In [39]: %timeit pd.merge_asof(df1, df2, on='time', by='key')
10 loops, best of 3: 41.4 ms per loop
But the factorization takes way longer than that and we haven't even gotten to the actual joining logic:
In [40]: %timeit df1.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 28.2 ms per loop
In [41]: %timeit df2.groupby(['key', 'time']).grouper.group_info
10 loops, best of 3: 177 ms per loop
The fastest possible approach is a single-pass algorithm. (And if we want this function to be remotely competitive with q/kdb+'s aj[]
, then we need to pay attention to performance.)
Out of curiosity, I took a crack at a single-pass
merge_asof()
. My sample passes the existing regression tests but is "wrong" in that it works only for a single object-type "by" parameter. I usePyObjectHashTable
while scanning through the right DataFrame to cache the most recently found row for each "by" object.I could add a little type differentiation if there is interest. I see that Tempita is getting some use in pandas. The main question is whether I can use multiple columns in the "by" parameter, which would be useful for matching things like
['ticker', 'exchange']
. Still investigating.