knizhnik / imcs

In-Memory Columnar Store extension for PostgreSQL
Apache License 2.0
203 stars 33 forks source link

ERROR: group by sequence doesn't match values sequence #34

Open amutu opened 10 years ago

amutu commented 10 years ago

for same data,if filter is countryid =1 ,the sql report error: ERROR: group by sequence doesn't match values sequence while if the filter is countryid =2 or countryid =3 ,then sql run OK

select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(2::int),y.uin),cs_filter(y.countryid = cs_const(2::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10; ERROR: group by sequence doesn't match values sequence Time: 139.287 ms

select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(1::int),y.uin),cs_filter(y.countryid = cs_const(1::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10; clientversion | cnt_distinct | revision ---------------+--------------+---------- (0 rows)

Time: 67.556 ms

amutu commented 10 years ago

wired, on another server with the same data ,all sql above run ok. only difference is on another server the data is fresh loaded(I restart the server today) while the server which report the error is running since last week and load data every hour from cron job.

amutu commented 10 years ago

some more case:

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '10 hour'::interval,now()::timestamp) as x where x is not null; run OK

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null; run OK

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '5 hour'::interval,now()::timestamp) as x where x is not null; ERROR: group by sequence doesn't match values sequence

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '4 hour'::interval,now()::timestamp) as x where x is not null; ERROR: group by sequence doesn't match values sequence

amutu commented 10 years ago

it looks like memory dirty on this machine:

select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null; NOTICE: IMCS command: const CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN NOTICE: IMCS command: eq NOTICE: IMCS command: filter NOTICE: IMCS command: count NOTICE: IMCS command: const CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN NOTICE: IMCS command: eq NOTICE: IMCS command: filter NOTICE: IMCS command: count c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+------- 27729 | 28334 | 27712 | 28334 | 27712 (1 row)

on another machine ,the result is : select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null; NOTICE: IMCS command: const CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN NOTICE: IMCS command: eq NOTICE: IMCS command: filter NOTICE: IMCS command: count NOTICE: IMCS command: const CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN NOTICE: IMCS command: eq NOTICE: IMCS command: filter NOTICE: IMCS command: count c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+------- 28334 | 28334 | 28334 | 28334 | 28334 (1 row)

Time: 2.476 ms

but in the bad state machine,some query run without problem,such: select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null; run OK

knizhnik commented 10 years ago

Is the problem reproduced if you switch off multithreading query execution: set imcs.n_threads=1 in postgresql.conf

amutu commented 10 years ago

no,restart the pg and reload the data to cs make all sql run ok.

amutu commented 10 years ago

How can I help to debug tihis problem? it happened again.

my usage: I have a partitioned table crashlog and the partition table named crashlog_p_20140101 ,crashlog_p_20140102...

I have a cron job,delete and load data every hour: step 1:delete oldest data to release memroy crashlog_delete(data of n days ago)

step 2:clean data of last hour in case of run this cmd multi times crashlog_delete(last hour till now)

step3:load data of last hour crashlog_load(filter := 'select data from carshlog where data is last hour')

at the same time,user may run sql on the server.

is it casued by some concurrent delete/select/load ? because if user not use(so thiere is not select or read),then the data is ok.I find this because after hole night load data,there is no such error in the morning,but in the afternoon,the error happened.

knizhnik commented 10 years ago

I found one possible synchronization issues in delete function. Please try to reproduce the problem with new version.

Also, do you have imcs_autoload enabled? If yes, please try to disable it (actually I do not want to say that it is impossible to use autoload in concurrent queries, but I want to understand if the problem is related with it or not).

amutu commented 10 years ago

I can repreduce the bug by 2 script concurrent running in 3 sessions: https://github.com/amutu/data/tree/master

while the data is constant insert into crashlog table,

open first session: while true do ./delete_load.sh >> ./delete_load.log done

open second session: while true do ./delete_load.sh >> ./delete_load2.log done

open third session: while true do ./check_cs_data.sh >> ./check_cs_data.log done

open forth session: tail -f ./check_cs_data.log

for a wile,you will see .....n data error found! where n >= 1

note:you can create wx_version as: create table wx_version as select clientversion from crashlog;

I see the code: /* This function is called in context protected by imcs->lock _/ void imcs_free_page(imcs_paget page) { imcs_free_page_t* pg = (imcs_free_page_t*)page; pg->next = imcs->free_pages; imcs->free_pages = pg; imcs->n_used_pages -= 1; }

but in imcs_delete_page,I can't find imcs->lock around imcs_free_page,is it relative?

knizhnik commented 10 years ago

Did you try the new version (61)?

amutu commented 10 years ago

I just try the new version,unfortunately,it still can be reproduced by the script above.

knizhnik commented 10 years ago

Sorry, but what was the actual error message? Did you inspect the logs? Unfortunately this scenario is not correct (this is why I didn't try to reproduce it). IMCS requires that data is always inserted in timestamp increasing order. But you concurrently run two delete_load.sh each of which tries to load is own data range based on system clock values! So first script may try to load [100..200] and second [110...210]. What will happen in second script will be able to execute it's load command first?

amutu commented 10 years ago

on the pgsql console,I get error some time(may happen,then disappear,then happen...): ERROR: group by sequence doesn't match values sequence when run sql: select cs_hash_dup_count(x.uin ,x.class1) from wx_version,crashlog_get(clientversion) as x where x is not null;

on the console: tail -f ./check_cs_data.log I get ".....n data error found!"

and sure, the load session get error aoubt the out of order.but this sould not make the cs length miss match

knizhnik commented 10 years ago

Actually I do not understand what makes psql return error code. If your query returns nothing (empty result set_, then psql still returns normal error code. It will return non-zero error code only if some error happen during query execution. And I suspect that such error will be "value out of timeseries order"! But please notice that this error is detected only for timestamp timeseries. And values are now appended in order of columns in the table. In your case: uin | clientversion | revision | crashtime ... It means that before this error was reported for crashtime (or what else you are using as timestamp), some other values from this columns will be successfully added to columnar store. And it will cause timeseries length mismatch. IMCS is not transactional storage - it is not able to undo partly completed operation. The obvious workarround is to populate first timeseries column. I committed this patch in revision 62.

amutu commented 10 years ago

Now,I can reproduce this bug in this branch : 64fb76fbfb68285c35802b33cdf3da99a293fb95

this is the step: crash=> select crashlog_truncate();

crashlog_truncate

(1 row)

crash=> select crashlog_load(filter:= 'logtime >= $$2014-04-14$$ and logtime < $$2014-04-14 01:00:00$$');

crashlog_load

    902405

(1 row)

---------------fresh data,no miss match cs-------------------- crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;

count

 0

(1 row)

--------------------delete has bug------------- crash=> select crashlog_delete(clientversion,'2014-04-14 00:14:33','2014-04-14 00:18:01') from wx_version;

crashlog_delete

           1
           0
           1

......
0 0 2 0 11462 1

(240 rows)

--------------cs data length miss match------------------ crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;

count

34

(1 row)