powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
272 stars 26 forks source link

PostgreSQL 11 compabitility : widen query IDs from 32 bits to 64 bits. #17

Closed anayrat closed 6 years ago

anayrat commented 6 years ago

Since cff440d368, queryid becomes a uint64 in pg_stat_statements.

SQL functions no longer use oid type but bigint for queryid attribute.

Should fix #16

rjuju commented 6 years ago

Thanks for the PR! The patch looks ok. Could you resolve the conflict on the changelog? Also, I see quite a lot of unrelated changes (see definition of pgqsHashKey and pgqsWalkerContext). I'd prever to revert those and have a pgident run afterwards.

rjuju commented 6 years ago

Sorry I'm not used to the review thing, so I clearly missed you did the change. Just for safety, did you run a postgres make installcheck with pgqs enabled for both pg<11 and pg11?

anayrat commented 6 years ago

Hi, I run make installcheck on all supported branches *_STABLE and master. All checks were ok, but I also run make installcheck-world and found this issue on master (and only master) :

============== running regression test queries        ==============
test postgres_fdw                 ... FAILED (test process exited with exit code 2)

It is hard to reproduce, sometimes the test pass.

If I run this query several times I got a segfault :

SELECT t1."C 1", t2.c1, t3.c1 FROM "S 1"."T 1" t1 full join ft1 t2 full join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1") OFFSET 100 LIMIT 10;

Here is the backtrace :

#0  0x000000000070053c in list_nth_cell (list=0x1639838, n=64995) at list.c:399
#1  0x0000000000700577 in list_nth (list=0x1639838, n=65001) at list.c:413
#2  0x00007f0c3f6e5786 in exprRepr (expr=0x163a5b8, buffer=0x15fc730, context=0x15fc240, include_const=true) at pg_qualstats.c:2090
#3  0x00007f0c3f6e5708 in exprRepr (expr=0x163a148, buffer=0x15fc730, context=0x15fc240, include_const=true) at pg_qualstats.c:2079
#4  0x00007f0c3f6e575b in exprRepr (expr=0x163a088, buffer=0x15fc730, context=0x15fc240, include_const=true) at pg_qualstats.c:2084
#5  0x00007f0c3f6e55e0 in hashExpr (expr=0x163a088, context=0x15fc240, include_const=true) at pg_qualstats.c:2058
#6  0x00007f0c3f6e396c in pgqs_process_opexpr (expr=0x163a088, context=0x15fc240) at pg_qualstats.c:1330
#7  0x00007f0c3f6e4076 in pgqs_whereclause_tree_walker (node=0x163a088, context=0x15fc240) at pg_qualstats.c:1526
#8  0x00000000006fb53d in expression_tree_walker (node=0x163a208, walker=0x7f0c3f6e3efe <pgqs_whereclause_tree_walker>, context=0x15fc240) at nodeFuncs.c:2119
#9  0x00007f0c3f6e2d3e in pgqs_collectNodeStats (planstate=0x1643e10, ancestors=0x0, context=0x15fc240) at pg_qualstats.c:967
#10 0x00007f0c3f6e2de9 in pgqs_collectNodeStats (planstate=0x1643c20, ancestors=0x0, context=0x15fc240) at pg_qualstats.c:982
#11 0x00007f0c3f6e250c in pgqs_ExecutorEnd (queryDesc=0x1582d10) at pg_qualstats.c:716
#12 0x0000000000695dc3 in ExecutorEnd (queryDesc=0x1582d10) at execMain.c:464
#13 0x00000000006348d7 in PortalCleanup (portal=0x15c0d00) at portalcmds.c:301
#14 0x00000000009e1054 in PortalDrop (portal=0x15c0d00, isTopCommit=false) at portalmem.c:496
#15 0x000000000084b98c in exec_simple_query (
    query_string=0x1558990 "SELECT t1.\"C 1\", t2.c1, t3.c1 FROM \"S 1\".\"T 1\" t1 full join ft1 t2 full join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1.\"C 1\") OFFSET 100 LIMIT 10;") at postgres.c:1131
#16 0x000000000084fa7b in PostgresMain (argc=1, argv=0x1585340, dbname=0x1585200 "contrib_regression", username=0x15851e8 "anayrat") at postgres.c:4149
#17 0x00000000007bee1d in BackendRun (port=0x157ea30) at postmaster.c:4409
#18 0x00000000007be588 in BackendStartup (port=0x157ea30) at postmaster.c:4081
#19 0x00000000007bacab in ServerLoop () at postmaster.c:1754
#20 0x00000000007ba31c in PostmasterMain (argc=11, argv=0x1552d30) at postmaster.c:1362
#21 0x00000000006f8251 in main (argc=11, argv=0x1552d30) at main.c:228

It will be hard to investigate, I dig into several core dump and segfaut doesn't happen on the same query:

2018-03-31 14:03:14.277 CEST [11164] LOG:  server process (PID 14725) was terminated by signal 11: Segmentation fault
2018-03-31 14:03:14.277 CEST [11164] DETAIL:  Failed process was running: EXPLAIN (verbose, costs off)
    UPDATE ft2 SET c3 = 'baz'
--
2018-03-31 14:04:41.872 CEST [11164] LOG:  server process (PID 14912) was terminated by signal 11: Segmentation fault
2018-03-31 14:04:41.872 CEST [11164] DETAIL:  Failed process was running: EXPLAIN (VERBOSE, COSTS OFF)
    SELECT t1.c1, t2.c2, t1.c3 FROM ft1 t1 FULL JOIN ft2 t2 ON (t1.c1 = t2.c1) WHERE postgres_fdw_abs(t1.c1) > 0 OFFSET 10 LIMIT 10;
--
2018-03-31 14:05:46.504 CEST [11164] LOG:  server process (PID 15028) was terminated by signal 11: Segmentation fault
2018-03-31 14:05:46.504 CEST [11164] DETAIL:  Failed process was running: SELECT t1."C 1", t2.c1, t3.c1 FROM "S 1"."T 1" t1 left join ft1 t2 full join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1") OFFSET 100 LIMIT 10;
2018-03-31 14:05:46.504 CEST [11164] LOG:  terminating any other active server processes
--
2018-03-31 14:06:57.693 CEST [11164] LOG:  server process (PID 15335) was terminated by signal 11: Segmentation fault
2018-03-31 14:06:57.693 CEST [11164] DETAIL:  Failed process was running: EXPLAIN (VERBOSE, COSTS OFF)
    SELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 BETWEEN 10 AND 15;
--
2018-03-31 14:09:35.664 CEST [15940] LOG:  server process (PID 16011) was terminated by signal 11: Segmentation fault
2018-03-31 14:09:35.664 CEST [15940] DETAIL:  Failed process was running: SELECT t1."C 1", t2.c1, t3.c1 FROM "S 1"."T 1" t1 full join ft1 t2 full join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1") OFFSET 100 LIMIT 10;
2018-03-31 14:09:35.664 CEST [15940] LOG:  terminating any other active server processes
--
2018-03-31 14:11:41.318 CEST [15940] LOG:  server process (PID 16107) was terminated by signal 11: Segmentation fault
2018-03-31 14:11:41.318 CEST [15940] DETAIL:  Failed process was running: SELECT t1."C 1", t2.c1, t3.c1 FROM "S 1"."T 1" t1 full join ft1 t2 full join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1") OFFSET 100 LIMIT 10;
2018-03-31 14:11:41.318 CEST [15940] LOG:  terminating any other active server processes
--
2018-03-31 14:23:05.459 CEST [15940] LOG:  server process (PID 17325) was terminated by signal 11: Segmentation fault
2018-03-31 14:23:05.459 CEST [15940] DETAIL:  Failed process was running: EXPLAIN (VERBOSE, COSTS OFF)
    SELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 BETWEEN 10 AND 15;

I tried the same test several times with p_q_s disabled in s_p_l and all tests passed.

I tried again with 10_STABLE with p_q_s enabled and I reproduced the issue :

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006e0eef in list_nth_cell (list=0x26883b8, n=64994) at list.c:399
399     for (match = list->head; n-- > 0; match = match->next)
(gdb) bt
#0  0x00000000006e0eef in list_nth_cell (list=0x26883b8, n=64994) at list.c:399
#1  0x00000000006e0f2a in list_nth (list=0x26883b8, n=65001) at list.c:413
#2  0x00007f2d7ac8d704 in exprRepr (expr=0x26898a8, buffer=0x2695168, context=0x2694828, include_const=1 '\001') at pg_qualstats.c:2090
#3  0x00007f2d7ac8d686 in exprRepr (expr=0x2688c98, buffer=0x2695168, context=0x2694828, include_const=1 '\001') at pg_qualstats.c:2079
#4  0x00007f2d7ac8d6d9 in exprRepr (expr=0x2688bd8, buffer=0x2695168, context=0x2694828, include_const=1 '\001') at pg_qualstats.c:2084
#5  0x00007f2d7ac8d55e in hashExpr (expr=0x2688bd8, context=0x2694828, include_const=1 '\001') at pg_qualstats.c:2058
#6  0x00007f2d7ac8b8e8 in pgqs_process_opexpr (expr=0x2688bd8, context=0x2694828) at pg_qualstats.c:1330
#7  0x00007f2d7ac8c004 in pgqs_whereclause_tree_walker (node=0x2688bd8, context=0x2694828) at pg_qualstats.c:1526
#8  0x00000000006dbf30 in expression_tree_walker (node=0x2688d58, walker=0x7f2d7ac8be8c <pgqs_whereclause_tree_walker>, context=0x2694828) at nodeFuncs.c:2120
#9  0x00007f2d7ac8ad07 in pgqs_collectNodeStats (planstate=0x268aaf8, ancestors=0x0, context=0x2694828) at pg_qualstats.c:967
#10 0x00007f2d7ac8a4db in pgqs_ExecutorEnd (queryDesc=0x2689c48) at pg_qualstats.c:716
#11 0x000000000067ef27 in ExecutorEnd (queryDesc=0x2689c48) at execMain.c:464
#12 0x00000000005fe5b9 in ExplainOnePlan (plannedstmt=0x26887e8, into=0x0, es=0x260c440, 
    queryString=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "..., params=0x0, queryEnv=0x0, planduration=0x7ffd10622820) at explain.c:574
#13 0x00000000005fe119 in ExplainOneQuery (query=0x264ffe8, cursorOptions=256, into=0x0, es=0x260c440, 
    queryString=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "..., params=0x0, queryEnv=0x0) at explain.c:373
#14 0x00000000005fddcb in ExplainQuery (pstate=0x260c330, stmt=0x262d278, 
    queryString=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "..., params=0x0, queryEnv=0x0, dest=0x260c2a0) at explain.c:256
#15 0x0000000000827ef4 in standard_ProcessUtility (pstmt=0x26737b0, 
    queryString=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x260c2a0, completionTag=0x7ffd10622b00 "") at utility.c:680
#16 0x0000000000827729 in ProcessUtility (pstmt=0x26737b0, 
    queryString=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x260c2a0, completionTag=0x7ffd10622b00 "") at utility.c:357
#17 0x0000000000826939 in PortalRunUtility (portal=0x2560b18, pstmt=0x26737b0, isTopLevel=1 '\001', setHoldSnapshot=1 '\001', dest=0x260c2a0, completionTag=0x7ffd10622b00 "") at pquery.c:1178
#18 0x0000000000826656 in FillPortalStore (portal=0x2560b18, isTopLevel=1 '\001') at pquery.c:1038
#19 0x0000000000826001 in PortalRun (portal=0x2560b18, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x2673890, altdest=0x2673890, completionTag=0x7ffd10622cf0 "") at pquery.c:768
#20 0x000000000082050b in exec_simple_query (
    query_string=0x25c9c68 "EXPLAIN (VERBOSE, COSTS OFF)\nSELECT ft4.c1, q.* FROM ft4 LEFT JOIN (SELECT 13, ft1.c1, ft2.c1 FROM ft1 RIGHT JOIN ft2 ON (ft1.c1 = ft2.c1) WHERE ft1.c1 = 12) q(a, b, c) ON (ft4.c1 = q.b) WHERE ft4.c1 "...) at postgres.c:1099
#21 0x0000000000824568 in PostgresMain (argc=1, argv=0x2570598, dbname=0x2570350 "contrib_regression", username=0x2570338 "anayrat") at postgres.c:4088
#22 0x0000000000796bdd in BackendRun (port=0x256cd20) at postmaster.c:4405
#23 0x0000000000796363 in BackendStartup (port=0x256cd20) at postmaster.c:4077
#24 0x0000000000792ac4 in ServerLoop () at postmaster.c:1755
#25 0x0000000000792141 in PostmasterMain (argc=11, argv=0x2540bd0) at postmaster.c:1363
#26 0x00000000006d8c51 in main (argc=11, argv=0x2540bd0) at main.c:228
rjuju commented 6 years ago

Hi @anayrat, @marco44 reported me the same bug few days ago. I guess it didn't reliably crashed due to pgqs.sample_rate sometimes skipping problematic queries.

I just pushed a fix for this issue. Can you confirm that it also fix the problem for you?

anayrat commented 6 years ago

Hello, Your fix solve the issue :)

rjuju commented 6 years ago

Coming back on this PR, sorry again for late answer. The code seems ok, but it has to be updated since it'll be part of v1.0.5. Do you want to do it as part if this PR or should I push a patch? I'd prefer a separate commit for the filename change, to keep the actual pg 11 compatibility part small and easier.

anayrat commented 6 years ago

@rjuju : I separate this PR in two commit, one for preparation for 1.0.5, one for PG11 compatibility. Should be fine now?

rjuju commented 6 years ago

Looks perfect, thanks!