Open amutu opened 10 years ago
wired, on another server with the same data ,all sql above run ok. only difference is on another server the data is fresh loaded(I restart the server today) while the server which report the error is running since last week and load data every hour from cron job.
some more case:
select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '10 hour'::interval,now()::timestamp) as x where x is not null; run OK
select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null; run OK
select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '5 hour'::interval,now()::timestamp) as x where x is not null; ERROR: group by sequence doesn't match values sequence
select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '4 hour'::interval,now()::timestamp) as x where x is not null; ERROR: group by sequence doesn't match values sequence
it looks like memory dirty on this machine:
select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null;
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+-------
27729 | 28334 | 27712 | 28334 | 27712
(1 row)
on another machine ,the result is :
select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null;
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+-------
28334 | 28334 | 28334 | 28334 | 28334
(1 row)
Time: 2.476 ms
but in the bad state machine,some query run without problem,such: select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null; run OK
Is the problem reproduced if you switch off multithreading query execution: set imcs.n_threads=1 in postgresql.conf
no,restart the pg and reload the data to cs make all sql run ok.
How can I help to debug tihis problem? it happened again.
my usage: I have a partitioned table crashlog and the partition table named crashlog_p_20140101 ,crashlog_p_20140102...
I have a cron job,delete and load data every hour: step 1:delete oldest data to release memroy crashlog_delete(data of n days ago)
step 2:clean data of last hour in case of run this cmd multi times crashlog_delete(last hour till now)
step3:load data of last hour crashlog_load(filter := 'select data from carshlog where data is last hour')
at the same time,user may run sql on the server.
is it casued by some concurrent delete/select/load ? because if user not use(so thiere is not select or read),then the data is ok.I find this because after hole night load data,there is no such error in the morning,but in the afternoon,the error happened.
I found one possible synchronization issues in delete function. Please try to reproduce the problem with new version.
Also, do you have imcs_autoload enabled? If yes, please try to disable it (actually I do not want to say that it is impossible to use autoload in concurrent queries, but I want to understand if the problem is related with it or not).
I can repreduce the bug by 2 script concurrent running in 3 sessions: https://github.com/amutu/data/tree/master
while the data is constant insert into crashlog table,
open first session: while true do ./delete_load.sh >> ./delete_load.log done
open second session: while true do ./delete_load.sh >> ./delete_load2.log done
open third session: while true do ./check_cs_data.sh >> ./check_cs_data.log done
open forth session: tail -f ./check_cs_data.log
for a wile,you will see .....n data error found! where n >= 1
note:you can create wx_version as: create table wx_version as select clientversion from crashlog;
I see the code: /* This function is called in context protected by imcs->lock _/ void imcs_free_page(imcs_paget page) { imcs_free_page_t* pg = (imcs_free_page_t*)page; pg->next = imcs->free_pages; imcs->free_pages = pg; imcs->n_used_pages -= 1; }
but in imcs_delete_page,I can't find imcs->lock around imcs_free_page,is it relative?
Did you try the new version (61)?
I just try the new version,unfortunately,it still can be reproduced by the script above.
Sorry, but what was the actual error message? Did you inspect the logs? Unfortunately this scenario is not correct (this is why I didn't try to reproduce it). IMCS requires that data is always inserted in timestamp increasing order. But you concurrently run two delete_load.sh each of which tries to load is own data range based on system clock values! So first script may try to load [100..200] and second [110...210]. What will happen in second script will be able to execute it's load command first?
on the pgsql console,I get error some time(may happen,then disappear,then happen...): ERROR: group by sequence doesn't match values sequence when run sql: select cs_hash_dup_count(x.uin ,x.class1) from wx_version,crashlog_get(clientversion) as x where x is not null;
on the console: tail -f ./check_cs_data.log I get ".....n data error found!"
and sure, the load session get error aoubt the out of order.but this sould not make the cs length miss match
Actually I do not understand what makes psql return error code. If your query returns nothing (empty result set_, then psql still returns normal error code. It will return non-zero error code only if some error happen during query execution. And I suspect that such error will be "value out of timeseries order"! But please notice that this error is detected only for timestamp timeseries. And values are now appended in order of columns in the table. In your case: uin | clientversion | revision | crashtime ... It means that before this error was reported for crashtime (or what else you are using as timestamp), some other values from this columns will be successfully added to columnar store. And it will cause timeseries length mismatch. IMCS is not transactional storage - it is not able to undo partly completed operation. The obvious workarround is to populate first timeseries column. I committed this patch in revision 62.
Now,I can reproduce this bug in this branch : 64fb76fbfb68285c35802b33cdf3da99a293fb95
this is the step: crash=> select crashlog_truncate();
(1 row)
crash=> select crashlog_load(filter:= 'logtime >= $$2014-04-14$$ and logtime < $$2014-04-14 01:00:00$$');
902405
(1 row)
---------------fresh data,no miss match cs-------------------- crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;
0
(1 row)
--------------------delete has bug------------- crash=> select crashlog_delete(clientversion,'2014-04-14 00:14:33','2014-04-14 00:18:01') from wx_version;
1
0
1
......
0
0
2
0
11462
1
(240 rows)
--------------cs data length miss match------------------ crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;
34
(1 row)
for same data,if filter is countryid =1 ,the sql report error: ERROR: group by sequence doesn't match values sequence while if the filter is countryid =2 or countryid =3 ,then sql run OK
select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(2::int),y.uin),cs_filter(y.countryid = cs_const(2::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10; ERROR: group by sequence doesn't match values sequence Time: 139.287 ms
select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(1::int),y.uin),cs_filter(y.countryid = cs_const(1::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10; clientversion | cnt_distinct | revision ---------------+--------------+---------- (0 rows)
Time: 67.556 ms