knizhnik / imcs

In-Memory Columnar Store extension for PostgreSQL
Apache License 2.0
201 stars 33 forks source link

[Q] Window aggregate partitioned by symbol and complex conditions #54

Closed pavitrakumar78 closed 4 years ago

pavitrakumar78 commented 4 years ago

Hi,

I have been experimenting with IMCS for about a day now and I'm very impressed with the speed and the flexibility available when writing queries with this extension! Thanks for creating this, @knizhnik !

I've been trying to emulate some queries I've written in pure PGSQL in IMCS but since this is my first time working with columnar db's I'm not entirely sure if I'm I have the right idea about how things work here. So I have questions I wanted to ask:

(For all the below examples, I'm assuming a schema exactly the same as that of Quote in the introduction docs here.

For example, say, if I wanted to know all the stocks which had 14-day simple moving average (SMA) over its close price between a particular period, in SQL that'd be:

select distinct(Symbol)
from(
    select Symbol, Day, Close, 
        avg(Close) over (partition by Symbol order by Day rows 14-1 preceding ) as sma14
    from Quote
    where Day between '2020-07-05' and '2020-08-05'
) tb1
where sma14 > close_price;

Using IMCS, I know I can get 14-day SMA for all stocks using the following query (taken from examples in performance comparison section):

select Symbol, cs_window_avg(Close, 14) as sma14 from SecurityQuotes;

But I'm not sure how I can get the filtering I've done in the SQL in IMCS - if I have to use cs_filter_pos, how would I use it? or does this type of filtering require a different type of table?

Also, I'm not sure if this is the right place to ask these questions but I did not find any discussion forums for IMCS (despite how big of an extension it is) so if I need to take this to an appropriate place, please do let me know where.

knizhnik commented 4 years ago

To implement the following SQL query in IMCS:

select distinct(Symbol)
from(
    select Symbol, Day, Close, 
        avg(Close) over (partition by Symbol order by Day rows 14-1 preceding ) as sma14
    from Quote
    where Day between '2020-07-05' and '2020-08-05'
) tb1
where sma14 > close_price;

you should run the following query:

select Symbol from (select Symbol, (select cs_count(cs_filter_pos(cs_window_avg(Close, 14)>Close) cnt from  Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)) s1 from SecurityQuotes) s2 where cnt>0;
pavitrakumar78 commented 4 years ago
select Symbol from (select Symbol, (select cs_count(cs_filter_pos(cs_window_avg(Close, 14)>Close) cnt from  Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)) s1 from SecurityQuotes) s2 where cnt>0;

There's a bracket missing - I think it's supposed to come before cnt?

select
    Symbol 
from
    (
        select
            Symbol,
            (
                select
                    cs_count(cs_filter_pos(cs_window_avg(Close, 14) > Close)) cnt 
                from
                    Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)
            )
            s1 
        from
            SecurityQuotes
    )
    s2 
where
    cnt > 0;

With this, I get error: ERROR: column "cnt" does not exist

knizhnik commented 4 years ago

Sorry,

select Symbol from (select Symbol, (select cs_count(cs_filter_pos(cs_window_avg(Close, 14)>Close) from  Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)) cnt from SecurityQuotes) s2 where cnt>0;
pavitrakumar78 commented 4 years ago
select Symbol from (select Symbol, (select cs_count(cs_filter_pos(cs_window_avg(Close, 14)>Close) from  Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)) cnt from SecurityQuotes) s2 where cnt>0;

Thank you for the correction. A bracket is still missing, but this is the final query, correct?

select
    Symbol 
from
    (
        select
            Symbol,
            (
                select
                    cs_count(cs_filter_pos(cs_window_avg(Close, 14) > Close)**)** <-- you last query was missing this bracket
                from
                    Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)
            )
            cnt 
        from
            SecurityQuotes
    )
    s2 
where
    cnt > 0;

I tried this but got segfault in the log and the db connection was closed unexpectedly. Here's the log:

2020-09-10 16:39:51.396 UTC [144610] LOG:  server process (PID 183472) was terminated by signal 11: Segmentation fault                 
2020-09-10 16:39:51.396 UTC [144610] DETAIL:  Failed process was running: select Symbol from (select Symbol, (select cs_count(cs_
filter_pos(cs_window_avg(Close, 14)>Close)) from  Quote_get(Symbol, '2020-07-05'::date, '2020-08-05' ::date)) 
cnt from SecuritiesQuotes)                                                                                                            
        s2 where cnt>0;                                                                                      
2020-09-10 16:39:51.396 UTC [144610] LOG:  terminating any other active server processes                                               
2020-09-10 16:39:51.397 UTC [169204] WARNING:  terminating connection because of crash of another server process                       
2020-09-10 16:39:51.397 UTC [169204] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and
 exit, because another server process exited abnormally and possibly corrupted shared memory.                                          
2020-09-10 16:39:51.397 UTC [169204] HINT:  In a moment you should be able to reconnect to the database and repeat your command.       
2020-09-10 16:39:51.397 UTC [227781] postgres@stock_db FATAL:  the database system is in recovery mode                                 
2020-09-10 16:39:51.399 UTC [144610] LOG:  all server processes terminated; reinitializing                                             
2020-09-10 16:39:51.782 UTC [227782] LOG:  database system was interrupted; last known up at 2020-09-10 16:03:53 UTC                   
2020-09-10 16:39:51.890 UTC [227782] LOG:  database system was not properly shut down; automatic recovery in progress                  
2020-09-10 16:39:51.892 UTC [227782] LOG:  redo starts at 5/6F70AA40                                                                   
2020-09-10 16:39:51.892 UTC [227782] LOG:  invalid record length at 5/6F70AB28: wanted 24, got 0                                       
2020-09-10 16:39:51.892 UTC [227782] LOG:  redo done at 5/6F70AAF0                                                                     
2020-09-10 16:39:51.906 UTC [144610] LOG:  database system is ready to accept connections  

Info about the table: Same schema as Quote table in the introduction docs here. ~ 55 mil rows/ ~19,000 stocks SELECT pg_size_pretty (pg_relation_size('Quote')) as tbl_size, pg_size_pretty (pg_indexes_size('Quote')) idx_size; returns: 4978 MB for table and 7176 MB for the index ( index on date, id, (date, id) )

System config: 8x 2.0 GHz dedicated CPU instance - 16 GB RAM, 320 GB SSD No other processes were running when this query was run.

I'm able to execute the queries mentioned in the intro page in secs/msecs without any issues. For example,

select cs_count(cs_filter_pos(cs_window_avg(Close, 14)>Close)) from  Quote_get('AAPL.US', '2020-07-05'::date, '2020-08-05' ::date);

Takes <0.5ms for the first run (measured using \timing).

The VWAP calculation for all the stocks (as mentioned in the intro) after creating the view takes ~5 secs to finish computing in this instance, but I'm not sure why the query for SMA>open for all stocks does not work. Is there anything else I can look at to figure out where the issue originates from? I should be able to load this entire table into memory with 16GB right?

Edit:

Here's what I get from gdb:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007f68418a37b2 in imcs_subseq_page.constprop () from /usr/lib/postgresql/12/lib/imcs.so

I've also attached my postgres.conf file with this post. I've not done many changes - the important changes are shared_buffers = 2GB, work_mem = 20MB and maintenance_work_mem = 1GB. postgresql.txt

knizhnik commented 4 years ago

Sorry, there was bug in handling empty subsequences, fixed.

knizhnik commented 4 years ago

I do not know which data you have imported in Quotes table, but NYSE_2003_2013 as it is clear from its name s contains dates from 2003 till 2013.

pavitrakumar78 commented 4 years ago

I do not know which data you have imported in Quotes table, but NYSE_2003_2013 as it is clear from its name s contains dates from 2003 till 2013.

Thank you for the fix.

I'm using different data set to test the queries. It's much bigger than NYSE dataset and has data from 1990s to 2020. I'm not near my work PC now - I'll report back in a few hrs after updating my local installation.

Is there a proper way to uninstall (completely remove) the extension fully and then install it fresh?

knizhnik commented 4 years ago

drop extension imcs + server restart should do it

pavitrakumar78 commented 4 years ago

Sorry, there was bug in handling empty subsequences, fixed.

I can confirm it works now! Thanks!

I have one more question: Some of my queries involve working with multiple indicators with LAG functions and with different periods. I don't need to explicitly handle the alignment of rows in PGSQL because the UDFs I've created return null for the first n values where n is period and functions like LAG also do the same. Also, arithmetic operations on these columns also don't require any special handling because any op with null is just null and thus the row sizes stay the same. But in IMCS, since we are dealing with arrays, this becomes a bit difficult to handle:

for example, in PG12, I can do:

select x, 
lag(x,3) over (order by x), 
(x-lag(x,3) over (order by x)) as result
from generate_series(0,10) as g(x);

which gives all 3 as result for the result column which is after exactly 3 null values i.e, doing computation with columns that do not have shift with ones that have null values is easy since I don't need to perform shift twice to align the column values. But in IMCS, I need to do the following to get the same result:

select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') << 3) - (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') >> 3);

It's simple for this case, but when you have nested indicators and do logical operations with them, there's a lot of room for error since I can't fall back on how the engine handles columns with null values as in native PG.

Is there a way to simplify this in IMCS or do I just have to deal with handling each array with the specific cs_cut_head or cs_cut_tail function?

knizhnik commented 4 years ago

Sorry, can you explain which IMCS query you want to perform I what is the expected result? Do you want something like this:

select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') << 3) - cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4');
pavitrakumar78 commented 4 years ago

Sorry, can you explain which IMCS query you want to perform I what is the expected result? Do you want something like this:

select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') << 3) - cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4');

I want to be able to use computed columns with lag along with computed columns without any lag/null values without having to deal with aligning the array as it becomes cumbersome when dealing with many computed columns with lags and shifts.

For example, executing

select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4')) - (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') >> 3);

would be the direct equivalent of PGSQL's query below in terms of syntax.

select x-lag(x,3) over (order by x) as result
from generate_series(0,10) as g(x);

but the IMCS query gives the wrong result compared to PGSQL's because the actual operation done in IMCS is : {0,1,2,3,4,5,6,7,8,9,10} (8,9,10 is ignored due to length mismatch) {0,1,2,3,4,5,6,7} -


= {0,0,0,0,0,0,0} (wrong compared to PGSQL's equivalent)

It's because the array is 'aligned' from left to right.

But in PGSQL, what happens after the shift is: {0,1,2,3,4,5,6,7,8,9,10} {X,X,X,0,1,2,3,4,5,6,7} -


= {3,3,3,3,3,3,3}

So to emulate the same behavior in IMCS, I have to carefully use shifts everywhere to make sure all the arrays are aligned else, I get the wrong result for a similar query to PGSQL. My question: Is there a simpler way of doing away with handling extra shifts in IMCS? Is it possible to use the same syntax above (only 1 shift as in PGSQL) to get the result as PGSQL? i.e, an easier way such that IMCS automatically handles shifts by appending NULLs at the beginning of the array and behaves like PGSQL.

pavitrakumar78 commented 4 years ago

Sorry for the double comment, but as another example, see below on how I calculate ATR in plain PGSQL:

select Symbol, Date,
avg(true_range, 14) over (partition by Symbol order by Date rows 14-1 preceding) as atr14
from(
select *,
GREATEST(
    High - lag(Close, 1) over (partition by Symbol order by Date), 
    lag(Close, 1) over (partition by Symbol order by Date) - Low, 
    High - Low
    ) as true_range
from Quotes
where Symbol = 'ABB' and Date >= '2010-01-01' and Date <= '2010-01-30'
) tb1

Looks simple and I've verified that the results are the same as the indicator values on ThinkOrSwim trading client (ATR using SMA).

But in IMCS, here's what I have to do to get the same result:

select Date, cs_window_avg(
cs_maxof(cs_maxof((High<<1) - (Close>>1), (Close>>1) - (Low<<1)), (High) - (Low)), 14)<<(14-1) as atr
from Quote_get('ABB', date('01-Jan-2010'), date('30-Jan-2010'));

I'm not using the built-in ATR to just show an example of how many shifts I would need to write to get the same result as the PGSQL one. Both these queries give the same output for me. But in the IMCS query, I have to write plenty of shifts to align the arrays in the right positions which I don't need to do in PGSQL. This approach of adjusting/shifting arrays becomes more error-prone as we include more indicators and columns with lags in the query.

It would be very useful and much easier to use IMCS for building complex indicators with shifts if the array operations automatically took care of prepending NULLs wherever necessary to keep the size constant. I hope this gives you a better idea of my request/question.

knizhnik commented 4 years ago

Sorry, in case of standard SQL result of select x-lag(x,3) over (order by x) as result from generate_series(0,10) as g(x); is {null,null,null,3,3,3,3,3,3,3,3} But IMCS doesn't support NULL values! This is why it is necessary to "align" arrays.

knizhnik commented 4 years ago

If you shift just one of sequence left, then you will get the same result, only NULL values are skept. postgres=# select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') << 3) - cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4'); ?column?

int4:{3,3,3,3,3,3,3,3} (1 row)

pavitrakumar78 commented 4 years ago

If you shift just one of sequence left, then you will get the same result, only NULL values are skept.

postgres=# select (cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4') << 3) - cs_parse('{0,1,2,3,4,5,6,7,8,9,10}', 'int4'); ?column? int4:{3,3,3,3,3,3,3,3} (1 row)

Yes. This is the other idea I had - basically, shift the column that does NOT have the actual lag parameter - i.e., if lag(x)-y in pgsql, in IMCS, you do x-lag(y). Thanks for the suggestion! I will try to work with this.

One last question: Is it normal for an IMCS query (a simple one) to take >2.5 hrs (still running as of this comment) on a db size of ~60mil rows after restarting database/entire instance? (My setup: 860 EVO SSD + 16 GB RAM + i5 3.4 GHz)

I've not messed with the settings file, so imcs.autoload is still true. I understand that IMCS needs to load data into memory, but the weird thing is that I can't see any increase in memory in the process manager that indicates that IMCS is loading something in the background. Unfortunately, I did have to turn off my work PC yesterday so had to reboot server (which I understand you do not recommend doing).

Could there be any other reason as to why it's taking so long? The entire db size itself is only ~5-6 GB.

knizhnik commented 4 years ago

Yes, most likely it is loading data. The size of memory is not increased, because it is part of shared memory preallocated at server start. You can check if there is disk activity using vmstat or iotop utilities.

pavitrakumar78 commented 4 years ago

Yes, most likely it is loading data. The size of memory is not increased, because it is part of shared memory preallocated at server start. You can check if there is disk activity using vmstat or iotop utilities.

Had to restart my server again - twice and it works now... weird... Maybe something to do with Ubuntu WSL (still not perfect). Takes ~5-10 mins to load after executing select Quote_load(); after a db restart.

Thanks for all your answers! :) I'll close this issue now!