knizhnik / imcs

In-Memory Columnar Store extension for PostgreSQL
Apache License 2.0
203 stars 33 forks source link

Is there an efficient way to deal with timeseries alignment when using group-by functions? #62

Closed pavitrakumar78 closed 3 years ago

pavitrakumar78 commented 3 years ago

Hi,

I've been writing a lot of supportive code to deal with the alignment of arrays when performing operations with 2 timeseries. I understand that since IMCS does not have NULL values, this is something we have to handle by ourselves.

In most cases, I can calculate the shift beforehand by estimating the date range and other parameters, but this becomes unpredictable when there is a grouping function involved - functions like cs_group_max, cs_group_avg etc.,

Here's an example:

select close_price - cs_group_min(close_price, cs_week(symbol_date)) from us_eod_get('AAPL.US', '2020-11-04', '2020-11-25');

What I'm trying to do is get the difference between the close price and the weekly lowest close price for a date range. This obviously gives incorrect result because the arrays are subtracted like this:

int8:{1147520,1188250,1186900,1163200,1159700,1194900,1192100,1192600, ...} -> close price
int8:{1147520,1159700,1173400,1138500}              ( - )                                           -> group_min weekly
--------------------------------------------------------------------------------------------------------
int8:{0,28550,13500,24700}

I can fix this by adding a left shift (<<) of 12 to close_price. But calculating this shift becomes complex because in actual use:

With these issues, I'm not sure how it's possible to correctly dynamically shift/align the arrays so that the calculation is correct.

Would it be possible to create or is there a way with current functions, we can iterate (the imcs_iterator_h) from the end and stop at the minimum distance of the 2 timeseries? That way, shifting won't be necessary and calculations would be correct or is there any other way to achieve the same effect but in a more efficient way other than to execute the queries individually, calculate the shift by looking at the result and then do the subtraction operation (as in the example above)?

Edit: Just wanted to say, IMCS has been a joy to use for my projects! Thank you again for building and supporting it!

knizhnik commented 3 years ago

As far I understand in this case you have no just alignment problem: what you actually need is something like ungroup operator. You group timeseries by week and then want to compare it with each day. IMCS has cs_repeat operator which can be used to repeat elements of input sequence requested amount of times. If you group data by week (7 days), then repeating grouped sequence 7 times should restore number of elements in original sequence.

There are two issues which can complicate usage of cs_repeat functions:

  1. origin offset (first element may not correspond to first day of the week)
  2. there are may be some holes in input sequence (holidays,...)

May be it is better to implement special "window" grouping function which will preserve number of elements of original sequence? cs_window_group_min({1,2,3,4,5},{1,1,1,2,2}) => {1,1,1,4,4}

pavitrakumar78 commented 3 years ago

There are two issues which can complicate usage of cs_repeat functions:

  1. origin offset (first element may not correspond to first day of the week)
  2. there are may be some holes in input sequence (holidays,...)

About these 2 issues, I was thinking I could solve the issue (1.) by making sure that the from_date part of *_get(....) function is the start of the timeframe - i.e if it's a week, I can push the date further back until I see a Monday, etc.,

But it seems like I've run into another problem when conducting some experiments with nesting cs_group functions. This might be related to the issue (2.) you mentioned - but I'm not sure yet.

Issue: average applied on cs_group's output differs from the values I see on my trading platforms and from simple pandas grouping.

Example:

select cs_window_avg((cs_group_last((close_price), (cs_week((symbol_date))))), 14)<<(13) from us_eod_get('SPY.US', '2020-01-03', '2020-11-27');

The last few values of the above executions is: {...,336.28, 336.77, 338.55, 340.33, 341.75, 343.57} These are not the correct values.

But if I use the same data with something simple like pandas-ta, where the code is like:

import pandas as pd
import pandas_ta as ta

bars = pd.read_csv(...)
symbol_from_date = '2020-01-03'
symbol_to_date = '2020-11-27'
bars = bars.loc[(bars.date>=symbol_from_date) & (bars.date<=symbol_to_date), :].reset_index(drop=True)
bars['date'] = pd.to_datetime(bars['date'])
logic = {'open'  : 'first',
         'high'  : 'max',
         'low'   : 'min',
         'close' : 'last',
         'volume': 'sum'}
barsw = bars.resample('W').apply(logic)
sma = ta.sma(barsw['close'], 14)

The last few values of the above executions (sma) is: [..., 333.20, 335.32, 337.04, 337.45, 339.13, 340.81, 342.14, 343.86] These values correspond with the values I see on my trading platforms so they are correct.

But what I'm confused about is, the output of the price columns form cs_group_last(close_price, cs_week(symbol_date)) and bars.resample('W').apply(logic) are exactly the same: {..., 333.83, 346.85, 347.29, 345.77, 326.54, 350.16, 358.10, 355.33, 363.67} So shouldn't the moving average result of this grouping also be the same?

For any data that is not grouped in any way, the SMA values match so I don't think there are any issues with the SMA module itself, but is this due to lack of handling the missing dates/holidays issue that you mentioned in your post? SPY.US.csv.zip

knizhnik commented 3 years ago

Sorry, I an not reproduce the problem with the attached CSV file:

select cs_window_avg(cs_group_last(Close, cs_week(Day)),14) from Quote_get('2020-01-03', '2020-11-27');          
 float8:{23.02928597586495,46.2942853655134,70.00500052315849,93.48857116699219,116.4692862374442,140.1978585379464,164.312144688
1975,188.1321454729353,209.2935747419085,230.5407169886998,249.7778603690011,266.120717729841,284.2221461704799,301.9500034877232
,298.7921469552176,296.0014343261719,292.5028621128627,289.2185777936663,287.1264343261719,283.846433367048,280.8350045340402,278
.7521471296038,280.4007175990513,280.8828604561942,283.6914324079241,288.7807170322963,292.9814322335379,297.9385746547154,301.04
71452985491,303.4928588867188,306.6035723005022,310.3021436418806,313.4735717773438,317.2735726492746,321.2121429443359,323.94428
58014788,324.9957144601004,326.884286063058,328.3192858014788,330.7328578404018,333.2057146344866,335.327144077846,337.0457153320
312,337.4500013078962,339.1385737827846,340.8192879813058,342.1400015694754,343.8678588867188}
pavitrakumar78 commented 3 years ago

Can you also please confirm the outputs for the following?

select cs_group_last('int4:{3329970,3332360,3342320,3314730,3360940,3354870,3354970,3365620,3372900,3358850,3369310,3381259,3415520,3427480,3461840,3469409,3491820,3479169,3511940,3562740,3440130,3412040,3318809,3384350,3325590,3327280,3371100,3388140,3374690,3345009,3306500,3269700,3303000,3226400,3235000,3287300,3341900,3323700,3348900,3370400,3338399,3397600,3349300,3407600,3437799,3468500,3524300,3501300,3479300,3475000,3472900,3420100,3433800,3427300,3446100,3457799,3393900,3382200,3266600,3299800,3265400,3302000,3360299,3435400,3502400,3501600,3545600,3540400,3566700,3532100,3581000,3625700,3606200,3562799,3577799,3553300,3574600,3632200,3626600,3636700}', cs_week('date:{2020-08-06,2020-08-07,2020-08-10,2020-08-11,2020-08-12,2020-08-13,2020-08-14,2020-08-17,2020-08-18,2020-08-19,2020-08-20,2020-08-21,2020-08-24,2020-08-25,2020-08-26,2020-08-27,2020-08-28,2020-08-31,2020-09-01,2020-09-02,2020-09-03,2020-09-04,2020-09-08,2020-09-09,2020-09-10,2020-09-11,2020-09-14,2020-09-15,2020-09-16,2020-09-17,2020-09-18,2020-09-21,2020-09-22,2020-09-23,2020-09-24,2020-09-25,2020-09-28,2020-09-29,2020-09-30,2020-10-01,2020-10-02,2020-10-05,2020-10-06,2020-10-07,2020-10-08,2020-10-09,2020-10-12,2020-10-13,2020-10-14,2020-10-15,2020-10-16,2020-10-19,2020-10-20,2020-10-21,2020-10-22,2020-10-23,2020-10-26,2020-10-27,2020-10-28,2020-10-29,2020-10-30,2020-11-02,2020-11-03,2020-11-04,2020-11-05,2020-11-06,2020-11-09,2020-11-10,2020-11-11,2020-11-12,2020-11-13,2020-11-16,2020-11-17,2020-11-18,2020-11-19,2020-11-20,2020-11-23,2020-11-24,2020-11-25,2020-11-27}'))/cs_const(10000,'float8');

My output: float8:{333.236,335.497,338.1259,349.182,341.204,332.728,330.65,328.73,333.8399,346.85,347.29,345.7799,326.54,350.16,358.1,355.33,363.67}

select cs_group_last(close_price, cs_week(symbol_date)) from us_eod_get('SPY.US', '2020-01-03', '2020-11-27');

My output: float8:{317.849,321.1019,327.254,324.119,317.178,327.5,332.824,328.762,292.069,293.252,265.51,226.889,251.303,246.117,275.8759,284.246,280.607,280.428,289.998,283.889,292.972,301.778,316.673,301.669,307.409,298.854,310.985,316.324,320.437,319.6,325.218,333.236,335.497,338.1259,349.182,341.204,332.728,330.65,328.73,333.8399,346.85,347.29,345.7799,326.54,350.16,358.1,355.33,363.67}

Upon closer inspection, the cs_group is not giving me the same result as the panda's Weekly group apply procedure. Only the last 7 or 8 values seem to match but after that, the differences start building up which ends up giving the wrong result for the cs_avg. I'm testing this on a clean imcs+postgres-12 on ubuntu 18.

knizhnik commented 3 years ago

Output of first query is exactly the same as yours. Output of second query is:

float4:{322.41,325.71,331.95,328.77,321.73,332.2,337.6,333.48,296.26,297.46,269.32,228.8,253.42,248.19,278.2,286.64,282.97,282.7
9,292.44,286.28,295.44,304.32,319.34,304.21,308.64,300.05,312.23,317.59,321.72,320.88,326.52,334.57,336.84,339.48,350.58,342.57,3
34.06,330.65,328.73,333.84,346.85,347.29,345.78,326.54,350.16,358.1,355.33,363.67}
pavitrakumar78 commented 3 years ago

Just figured it out. I thought this could not have been possible but alas - the SPY data in my db is not consistent with the CSV data. Tried a few other symbols and they match in values! Not sure how this happened but something that's completely unrelated to IMCS and that I need to fix - sorry about that!


About the original query of this issue, I'll try some experiments based on the ideas you have mentioned in your comment and update or post a comment here about my findings.

I hope it's OK to keep this issue open.

knizhnik commented 3 years ago

I have added cs_wingroup* family of functions. I hope now your task can be solved in more easy way.

pavitrakumar78 commented 3 years ago

I have added cs_wingroup* family of functions. I hope now your task can be solved in more easy way.

Wow! Thank you! This is super useful for multi-timeseries comparison!!! This will help definitely with my task!