Open chrisaycock opened 7 years ago
so I think some spelling is in order here, to make these easier to grok (should apply these generally to merging operations), maybe something like this:
merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms), how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})
as
(df1.merging(df2, left_on='trade_time', right_on='exchange_time')
.window(time=(0*ms, 500*ms))
.agg({'total_volume':(np.sum, 'volume'),
'mean_volume':(np.mean, 'volume')})
)
so .merging
is a like a group/rolling/resample operator, the next .window()
is the type of merge (e.g. support .asof()
, .ordered()
)
👍 on Jeff's spelling. The original function seems to be doing a lot, better to split it up into components that are understandable.
The .window
step could be generalized somewhat. e.g. https://github.com/dgrtwo/fuzzyjoin has some examples on numeric and word distance.
cc @wesm
A quick example, as much for my benefit:
import string
df1 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s'),
'name': list(string.ascii_uppercase[:20])})
df2 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:29:50', periods=15, freq='2s'),
'value': np.arange(15, dtype=np.float64)})
Suppose I want to compute this:
s = pd.Timedelta(1, 's')
merge_window(df1, df2, on='date', time_range=(-6*s, 2*s), how=(np.sum, 'value'))
The result would be
df1['value'] = [df2[(df2.date >= start) & (df2.date <= stop)]['value'].sum() for start,stop in zip(df1.date + -6*s, df1.date + 2*s)]
I'm going to take a crack at this now. I won't have Jeff's notation to start since I think that's a bigger conversation. But I do want to get this functionality as soon as possible.
Having a composable API for joins in general would be nice. How do you imagine the asof version would look?
As an aside, the syntax for describing aggregates leaves a lot to be desired (this is not specific to this proposal). I spent a bunch of time thinking about how to do this (you can see what I came up with in some of the examples in http://docs.ibis-project.org/generated-notebooks/2.html#aggregating-joined-table-with-metrics-involving-more-than-one-base-reference). We may be able to study some other DSL (e.g. Blaze) to think about deferred join syntax. Adding any kind of expression DSL becomes a deep rabbit hole (e.g. deterministically resolving output types). Maybe something that should wait for pandas 2.0
Hi, what's the status on this guy?
@zak-b2c2 pandas is a community led project and so contributions move it forwards
you are welcome to submit a pull request
I would like a time-based aggregating function that also acts as a join:
http://code.kx.com/wiki/Reference/wj
In pandas, this might look something like:
This would compute the total volume from df2 whose timestamps are within (-100, 100) ms of df1’s timestamps and whose tickers match.
I imagine I could specify different columns:
By the way, this is a more general form of the
.rolling()
functions @jreback wrote recently since I can just use one DataFrame as both parameters:My firm has custom business calendars for things like market hours as well as exchange holidays. The kdb version takes arrays of timestamps directly for the begin and end, which handles the general-purpose case of custom business calendars. So I imagine could get the five-day average of volume with:
I can get started on this right away if my proposal makes sense.