powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
272 stars 26 forks source link

crash on PG 9.6.5 make installcheck #15

Closed anse1 closed 6 years ago

anse1 commented 6 years ago

Hi,

running the PostgreSQL 9.6.5 regression tests on a database with powa + pg_qualstats installed has a 30% chance of crashing for me while running the object_address test. Backtrace below.

regards, andreas

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  pgqs_collectNodeStats (planstate=0x557e4d260cd8, context=context@entry=0x557e4d26f368, ancestors=0x0) at pg_qualstats.c:938
    938     total_filtered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    (gdb) bt
    #0  pgqs_collectNodeStats (planstate=0x557e4d260cd8, context=context@entry=0x557e4d26f368, ancestors=0x0) at pg_qualstats.c:938
    #1  0x00007f3d6b55fa0d in pgqs_ExecutorEnd (queryDesc=0x557e4d25ebe8) at pg_qualstats.c:695
    #2  0x0000557e4b3c96c3 in PortalCleanup (portal=0x557e4d154d58) at portalcmds.c:280
    #3  0x0000557e4b64bbf3 in PortalDrop (portal=0x557e4d154d58, isTopCommit=<optimized out>) at portalmem.c:510
    #4  0x0000557e4b43020c in SPI_cursor_close (portal=<optimized out>) at spi.c:1487
    #5  0x00007f3d61f9904e in exec_stmt_fors (stmt=0x557e4d22e748, estate=0x7ffc6d6fdf80) at pl_exec.c:2121
    #6  exec_stmt (stmt=0x557e4d22e748, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #7  exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #8  0x00007f3d61f9c6f1 in exec_for_query (estate=estate@entry=0x7ffc6d6fdf80, stmt=stmt@entry=0x557e4d22e570, portal=0x557e4d154c40, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
    #9  0x00007f3d61f9903e in exec_stmt_fors (stmt=0x557e4d22e570, estate=0x7ffc6d6fdf80) at pl_exec.c:2116
    #10 exec_stmt (stmt=0x557e4d22e570, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #11 exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #12 0x00007f3d61f9c6f1 in exec_for_query (estate=estate@entry=0x7ffc6d6fdf80, stmt=stmt@entry=0x557e4d22e1c8, portal=0x557e4d154b28, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
    #13 0x00007f3d61f9903e in exec_stmt_fors (stmt=0x557e4d22e1c8, estate=0x7ffc6d6fdf80) at pl_exec.c:2116
    #14 exec_stmt (stmt=0x557e4d22e1c8, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #15 exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #16 0x00007f3d61f9b627 in exec_stmt_block (estate=estate@entry=0x7ffc6d6fdf80, block=0x557e4d22f678) at pl_exec.c:1336
    #17 0x00007f3d61f9b845 in plpgsql_exec_function (func=func@entry=0x557e4d20e930, fcinfo=fcinfo@entry=0x7ffc6d6fe1c0, simple_eval_estate=simple_eval_estate@entry=0x557e4d223da0) at pl_exec.c:434
    #18 0x00007f3d61f901f1 in plpgsql_inline_handler (fcinfo=<optimized out>) at pl_handler.c:329
    #19 0x0000557e4b62e262 in OidFunctionCall1Coll (functionId=<optimized out>, collation=collation@entry=0, arg1=94000947505744) at fmgr.c:1592
    #20 0x0000557e4b3bef72 in ExecuteDoStmt (stmt=stmt@entry=0x557e4d202d00) at functioncmds.c:2177
    #21 0x0000557e4b52d178 in standard_ProcessUtility (parsetree=0x557e4d202d00, queryString=0x557e4d1bfde0 [...], context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at utility.c:515
    #22 0x00007f3d6b768d1d in pgss_ProcessUtility (parsetree=0x557e4d202d00, queryString=0x557e4d1bfde0 [...], context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at pg_stat_statements.c:986
    #23 0x0000557e4b529f34 in PortalRunUtility (portal=0x557e4d154a10, utilityStmt=0x557e4d202d00, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at pquery.c:1193
    #24 0x0000557e4b52aad2 in PortalRunMulti (portal=portal@entry=0x557e4d154a10, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x557e4d1c14f0, altdest=altdest@entry=0x557e4d1c14f0, completionTag=completionTag@entry=0x7ffc6d6ff060 "") at pquery.c:1349
    #25 0x0000557e4b52b81e in PortalRun (portal=portal@entry=0x557e4d154a10, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x557e4d1c14f0, altdest=altdest@entry=0x557e4d1c14f0, completionTag=completionTag@entry=0x7ffc6d6ff060 "") at pquery.c:815
    #26 0x0000557e4b528f62 in exec_simple_query (query_string=0x557e4d1bfde0 [...]) at postgres.c:1086
    #27 PostgresMain (argc=<optimized out>, argv=argv@entry=0x557e4d1636b0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4074
    #28 0x0000557e4b27e13d in BackendRun (port=0x557e4d15fb50) at postmaster.c:4294
    #29 BackendStartup (port=0x557e4d15fb50) at postmaster.c:3968
    #30 ServerLoop () at postmaster.c:1719
    #31 0x0000557e4b4be2e3 in PostmasterMain (argc=3, argv=0x557e4d135520) at postmaster.c:1327
    #32 0x0000557e4b27f6bd in main (argc=3, argv=0x557e4d135520) at main.c:228
    (gdb) list
    933     if (list_length(parent) > 1)
    934     {
    935         context->uniquequalid = hashExpr((Expr *) parent, context, true);
    936         context->qualid = hashExpr((Expr *) parent, context, false);
    937     }
    938     total_filtered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    939     context->nbfiltered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    940     context->count = planstate->instrument->tuplecount + planstate->instrument->ntuples + total_filtered;
    941     /* Add the indexquals */
    942     context->evaltype = 'i';
    (gdb) p planstate->instrument
    $1 = (Instrumentation *) 0x0
    (gdb) p debug_query_string
    $2 = 0x557e4d1bfde0 "DO $$\nDECLARE\n\tobjtype text;\n\tnames\ttext[];\n\targs\ttext[];\nBEGIN\n\tFOR objtype IN VALUES\n\t\t('table'), ('index'), ('sequence'), ('view'),\n\t\t('materialized view'), ('foreign table'),\n\t\t('table column'), ('foreign table column'),\n\t\t('aggregate'), ('function'), ('type'), ('cast'),\n\t\t('table constraint'), ('domain constraint'), ('conversion'), ('default value'),\n\t\t('operator'), ('operator class'), ('operator family'), ('rule'), ('trigger'),\n\t\t('text search parser'), ('text search dictionary'),\n\t\t('text search template'), ('text search configuration'),\n\t\t('policy'), ('user mapping'), ('default acl'), ('transform'),\n\t\t('operator of access method'), ('function of access method')\n\tLOOP\n\t\tFOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}')\n\t\tLOOP\n\t\t\tFOR args IN VALUES ('{}'), ('{integer}')\n\t\t\tLOOP\n\t\t\t\tBEGIN\n\t\t\t\t\tPERFORM pg_get_object_address(objtype, names, args);\n\t\t\t\tEXCEPTION WHEN OTHERS THEN\n\t\t\t\t\t\tRAISE WARNING 'error for %,%,%: %', objtype, names, args, sqlerrm;\n\t\t\t\tEND;\n\t\t\tEND LOOP;\n\t\tEND LOOP;\n\tEND LOOP;\nEND;\n$$;"
    (gdb) 
rjuju commented 6 years ago

Hello,

Thanks a lot for the detailed report and the analysis! I can also reproduce this issue. I'll dig into this, since I don't see an obvious reason why no instrumentation is available here.

rjuju commented 6 years ago

Ok I understand the issue. FTR, the related statement is

DO $$
DECLARE
    objtype text;
    names    text[];
    args    text[];
BEGIN
    FOR objtype IN VALUES
        ('table'), ('index'), ('sequence'), ('view'),
        ('materialized view'), ('foreign table'),
        ('table column'), ('foreign table column'),
        ('aggregate'), ('function'), ('type'), ('cast'),
        ('table constraint'), ('domain constraint'), ('conversion'), ('default value'),
        ('operator'), ('operator class'), ('operator family'), ('rule'), ('trigger'),
        ('text search parser'), ('text search dictionary'),
        ('text search template'), ('text search configuration'),
        ('policy'), ('user mapping'), ('default acl'), ('transform'),
        ('operator of access method'), ('function of access method')
    LOOP
        FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}')
        LOOP
            FOR args IN VALUES ('{}'), ('{integer}')
            LOOP
                BEGIN
                    PERFORM pg_get_object_address(objtype, names, args);
                EXCEPTION WHEN OTHERS THEN
                        RAISE WARNING 'error for %,%,%: %', objtype, names, args, sqlerrm;
                END;
            END LOOP;
        END LOOP;
    END LOOP;
END;
$$;

I'm not familiar with plpgsql internals, but IIUC with such nested FOR loop, multiple executor are started in a nested way. Therefore, each executor start will be seen as a nested_level of 0, so query_is_sampled will be set independantly. The problem is that query_is_sampled is a global variable.

So if the first executor isn't sampled but the second is, query_is_sampled will be set to true, and the first call to pgqs_ExecutorEnd will think that the query was sampled while it wasn't, try to read the instrument field and happily fail.

I don't really see an easy way to fix it. We could of course add an if (planstate->instrument) in pgqs_ExecutorEnd to ignore "false positive sampled queries", but we'd still miss the opposite case. Also, we'd need to make sure it wasn't something else, like auto_explain adding some instrumentations.

The easiet and safest solution I can imagine would be to save the query_is_sampled information in the QueryDesc, but after a quick look I don't see any custom field available nearby.

I guess we could track start/stop sequences and keep the same sampled state for any inner executor, but if any error happens, executor end won't be called, which means the backend will be forever stuck in either sampled state :(

Do you have any idea on a proper fix?

rjuju commented 6 years ago

Hi @anse1, sorry for late answer.

I couldn't find an easy solution to fix this issue, so I just added a test on the effective instrumentation to avoid the "falise positive" case. The point of sampling is to avoid overhead with highly loaded servers, so missing some queries in such cases shouldn't make any difference.

Fix pushed in ae73c10b23f6a976cba3783c2fe9a2c1d52ed6cc, thanks again for the report!