pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.64k stars 17.58k forks source link

kdb-like window join for pandas #13959

Open chrisaycock opened 7 years ago

chrisaycock commented 7 years ago

I would like a time-based aggregating function that also acts as a join:

http://code.kx.com/wiki/Reference/wj

In pandas, this might look something like:

ms = pd.Timedelta(1, 'ms')
merge_window(df1, df2, on='time', by='ticker', time_range=(-100*ms, 100*ms), how=(np.sum, 'volume'))

This would compute the total volume from df2 whose timestamps are within (-100, 100) ms of df1’s timestamps and whose tickers match.

I imagine I could specify different columns:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms),
             how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

By the way, this is a more general form of the .rolling() functions @jreback wrote recently since I can just use one DataFrame as both parameters:

merge_window(df, df, on='time', time_range=(-5*ms, 0*ms), how=[(np.median, 'price'), (np.sum, 'volume')])

My firm has custom business calendars for things like market hours as well as exchange holidays. The kdb version takes arrays of timestamps directly for the begin and end, which handles the general-purpose case of custom business calendars. So I imagine could get the five-day average of volume with:

# internal functions
cal = get_our_internal_business_calendar()
begin = adjust_our_time(df.time, -5, cal)
end = adjust_our_time(df.time, 0, cal)

# use values directly
merge_window(df1, df2, on='time', begin_times=begin, end_times=end, how=(np.mean, 'volume'))

I can get started on this right away if my proposal makes sense.

jreback commented 7 years ago

so I think some spelling is in order here, to make these easier to grok (should apply these generally to merging operations), maybe something like this:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms), how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

as

(df1.merging(df2, left_on='trade_time', right_on='exchange_time')
    .window(time=(0*ms, 500*ms))
    .agg({'total_volume':(np.sum, 'volume'), 
          'mean_volume':(np.mean, 'volume')})
)

so .merging is a like a group/rolling/resample operator, the next .window() is the type of merge (e.g. support .asof(), .ordered())

TomAugspurger commented 7 years ago

👍 on Jeff's spelling. The original function seems to be doing a lot, better to split it up into components that are understandable.

The .window step could be generalized somewhat. e.g. https://github.com/dgrtwo/fuzzyjoin has some examples on numeric and word distance.

chrisaycock commented 7 years ago

cc @wesm

chrisaycock commented 7 years ago

A quick example, as much for my benefit:

import string

df1 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s'),
                    'name': list(string.ascii_uppercase[:20])})

df2 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:29:50', periods=15, freq='2s'),
                    'value': np.arange(15, dtype=np.float64)})

Suppose I want to compute this:

s = pd.Timedelta(1, 's')
merge_window(df1, df2, on='date', time_range=(-6*s, 2*s), how=(np.sum, 'value'))

The result would be

df1['value'] = [df2[(df2.date >= start) & (df2.date <= stop)]['value'].sum() for start,stop in zip(df1.date + -6*s, df1.date + 2*s)]

I'm going to take a crack at this now. I won't have Jeff's notation to start since I think that's a bigger conversation. But I do want to get this functionality as soon as possible.

wesm commented 7 years ago

Having a composable API for joins in general would be nice. How do you imagine the asof version would look?

As an aside, the syntax for describing aggregates leaves a lot to be desired (this is not specific to this proposal). I spent a bunch of time thinking about how to do this (you can see what I came up with in some of the examples in http://docs.ibis-project.org/generated-notebooks/2.html#aggregating-joined-table-with-metrics-involving-more-than-one-base-reference). We may be able to study some other DSL (e.g. Blaze) to think about deferred join syntax. Adding any kind of expression DSL becomes a deep rabbit hole (e.g. deterministically resolving output types). Maybe something that should wait for pandas 2.0

zak-b2c2 commented 3 years ago

Hi, what's the status on this guy?

jreback commented 3 years ago

@zak-b2c2 pandas is a community led project and so contributions move it forwards

you are welcome to submit a pull request