powa-team / powa-archivist

powa-archivist: the powa PostgreSQL extension
http://powa.readthedocs.io/
PostgreSQL License
51 stars 20 forks source link

Crash on pg_reload_conf() #10

Closed anse1 closed 6 years ago

anse1 commented 6 years ago

Hi,

running sqlsmith on a database with powa installed produced crashes on pg_reload_conf(). In the backtrace below the handler fired in a CriticalSection, where AllocSetContextCreate is forbidden. There are probably a lot more failure modes. For example, any elog() would be promoted to a PANIC in a Critical Section.

IMHO powa must do this like the romans do: Don't do anything in the signal handler besides setting a flag that a reload is requested and do the actual work outside the handler where it is safe to do so.

regards, Andreas

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
58      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f455d6b440a in __GI_abort () at abort.c:89
#2  0x000055b745359ec3 in ExceptionalCondition (conditionName=conditionName@entry=0x55b7453dae63 "!(CritSectionCount == 0)", errorType=errorType@entry=0x55b74539d29d "FailedAssertion", fileName=fileName@e$
#3  0x000055b74537f1a2 in MemoryContextCreate (tag=tag@entry=T_AllocSetContext, size=size@entry=200, methods=methods@entry=0x55b7458023e0 <AllocSetMethods>, parent=0x55b745a55dc0, name=name@entry=0x55b745$
#4  0x000055b74537e5a0 in AllocSetContextCreate (parent=<optimized out>, name=name@entry=0x55b74551c03c "config file processing", minContextSize=minContextSize@entry=0, initBlockSize=initBlockSize@entry=8$
#5  0x000055b745379905 in ProcessConfigFile (context=PGC_SIGHUP) at guc-file.l:147
#6  0x00007f455d057dae in powa_sighup (postgres_signal_arg=<optimized out>) at powa.c:342
#7  <signal handler called>
#8  0x00007f455d75bbc0 in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84
#9  0x000055b74504709c in XLogFileClose () at xlog.c:3557
#10 0x000055b74504740d in XLogWrite (WriteRqst=..., flexible=flexible@entry=0 '\000') at xlog.c:2253
#11 0x000055b74504809e in XLogFlush (record=15030149624) at xlog.c:2728
#12 0x000055b74503e6ca in RecordTransactionCommit () at xact.c:1286
#13 CommitTransaction () at xact.c:2039
#14 0x000055b74503f465 in CommitTransactionCommand () at xact.c:2874
#15 0x00007f455d057ff2 in powa_main (main_arg=<optimized out>) at powa.c:293
#16 0x000055b7451e560b in StartBackgroundWorker () at bgworker.c:742
#17 0x000055b7451f1135 in do_start_bgworker (rw=0x55b745a48ba0) at postmaster.c:5612
#18 maybe_start_bgworkers () at postmaster.c:5809
#19 0x000055b7451f14a3 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2834
#20 <signal handler called>
#21 0x00007f455d761273 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#22 0x000055b744fb29be in ServerLoop () at postmaster.c:1683
#23 0x000055b7451f3153 in PostmasterMain (argc=3, argv=0x55b745a245b0) at postmaster.c:1327
#24 0x000055b744fb469d in main (argc=3, argv=0x55b745a245b0) at main.c:228
anse1 commented 6 years ago

Here's another crash resulting from the signal handler doing too much: In the backtrace below, it reentered non-reentrant code while SPI was in progress.

regards Andreas

Core was generated by `postgres: bgworker: powa   snapshot                              '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
58  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f8eb41f640a in __GI_abort () at abort.c:89
#2  0x0000558c762b5ec3 in ExceptionalCondition (conditionName=conditionName@entry=0x558c76443689 "!(!InProgressBuf)", errorType=errorType@entry=0x558c762f929d "FailedAssertion", fileName=fileName@entry=0x558c764435d6 "bufmgr.c", lineNumber=lineNumber@entry=3835) at assert.c:54
#3  0x0000558c76186154 in StartBufferIO (buf=buf@entry=0x7f8eb0b03b80, forInput=forInput@entry=1 '\001') at bufmgr.c:3835
#4  0x0000558c761889a3 in BufferAlloc (foundPtr=0x7ffe8234255b "", strategy=0x0, blockNum=0, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x558c78ae2958) at bufmgr.c:1326
#5  ReadBuffer_common (smgr=0x558c78ae2958, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffe8234260f "") at bufmgr.c:745
#6  0x0000558c76188c60 in ReadBufferExtended (reln=reln@entry=0x7f8eb4cc14b0, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=0, mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at bufmgr.c:664
#7  0x0000558c76188d2e in ReadBuffer (reln=reln@entry=0x7f8eb4cc14b0, blockNum=<optimized out>) at bufmgr.c:596
#8  0x0000558c75f6922e in _bt_getbuf (rel=0x7f8eb4cc14b0, blkno=<optimized out>, access=1) at nbtpage.c:576
#9  0x0000558c75f69aa7 in _bt_getroot (rel=rel@entry=0x7f8eb4cc14b0, access=access@entry=1) at nbtpage.c:158
#10 0x0000558c75f6fe6b in _bt_search (rel=rel@entry=0x7f8eb4cc14b0, keysz=keysz@entry=2, scankey=scankey@entry=0x7ffe82342880, nextkey=nextkey@entry=0 '\000', bufP=bufP@entry=0x7ffe8234277c, access=access@entry=1, snapshot=0x558c78be47d8) at nbtsearch.c:99
#11 0x0000558c75f70bf4 in _bt_first (scan=scan@entry=0x558c78fc58d0, dir=dir@entry=ForwardScanDirection) at nbtsearch.c:983
#12 0x0000558c75f6cbd2 in btgettuple (scan=0x558c78fc58d0, dir=ForwardScanDirection) at nbtree.c:326
#13 0x0000558c75f63d5b in index_getnext_tid (scan=scan@entry=0x558c78fc58d0, direction=direction@entry=ForwardScanDirection) at indexam.c:415
#14 0x0000558c75f6407b in index_getnext (scan=0x558c78fc58d0, direction=direction@entry=ForwardScanDirection) at indexam.c:553
#15 0x0000558c75f62f49 in systable_getnext (sysscan=sysscan@entry=0x558c78f70d80) at genam.c:416
#16 0x0000558c762a1bdf in SearchCatCache (cache=0x558c78aac118, v1=<optimized out>, v2=<optimized out>, v3=<optimized out>, v4=<optimized out>) at catcache.c:1183
#17 0x0000558c762afc06 in GetSysCacheOid (cacheId=cacheId@entry=54, key1=<optimized out>, key2=<optimized out>, key3=key3@entry=0, key4=key4@entry=0) at syscache.c:1061
#18 0x0000558c75fc37b6 in get_ts_config_oid (names=0x558c78fc5848, missing_ok=missing_ok@entry=1 '\001') at namespace.c:2506
#19 0x0000558c762b5d65 in check_TSCurrentConfig (newval=0x7ffe82343df0, extra=<optimized out>, source=PGC_S_FILE) at ts_cache.c:599
#20 0x0000558c762cc6cf in call_string_check_hook (newval=newval@entry=0x7ffe82343df0, extra=extra@entry=0x7ffe82343df8, source=source@entry=PGC_S_FILE, elevel=elevel@entry=12, conf=<optimized out>, conf=<optimized out>) at guc.c:9799
#21 0x0000558c762cea3c in parse_and_validate_value (record=record@entry=0x558c76756850 <ConfigureNamesString+6992>, name=name@entry=0x558c78f72bc8 "default_text_search_config", value=value@entry=0x558c78f72c00 "pg_catalog.english", source=source@entry=PGC_S_FILE, elevel=elevel@entry=12, newval=newval@entry=0x7ffe82343df0, newextra=0x7ffe82343df8) at guc.c:5730
#22 0x0000558c762cf6cd in set_config_option (name=0x558c78f72bc8 "default_text_search_config", value=0x558c78f72c00 "pg_catalog.english", context=context@entry=PGC_SIGHUP, source=source@entry=PGC_S_FILE, action=action@entry=GUC_ACTION_SET, changeVal=<optimized out>, changeVal@entry=1 '\001', elevel=12, is_reload=0 '\000') at guc.c:6332
#23 0x0000558c762d5280 in ProcessConfigFileInternal (context=context@entry=PGC_SIGHUP, applySettings=applySettings@entry=1 '\001', elevel=elevel@entry=13) at guc-file.l:439
#24 0x0000558c762d5924 in ProcessConfigFile (context=PGC_SIGHUP) at guc-file.l:155
#25 0x00007f8eb3b99dae in powa_sighup (postgres_signal_arg=<optimized out>) at powa.c:342
#26 <signal handler called>
#27 0x00007f8eb429d5a0 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:84
#28 0x0000558c7618c367 in FileRead (file=<optimized out>, buffer=buffer@entry=0x7f8eb10f6300 "\001", amount=amount@entry=8192) at fd.c:1623
#29 0x0000558c761b357f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=1, buffer=0x7f8eb10f6300 "\001") at md.c:754
#30 0x0000558c761881f4 in ReadBuffer_common (smgr=0x558c78ae2d78, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffe8234461f "") at bufmgr.c:890
#31 0x0000558c76188c60 in ReadBufferExtended (reln=reln@entry=0x7f8eb06bdc68, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=1, mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at bufmgr.c:664
#32 0x0000558c76188dc4 in ReadBuffer (blockNum=<optimized out>, reln=0x7f8eb06bdc68) at bufmgr.c:596
#33 ReleaseAndReadBuffer (buffer=buffer@entry=704, relation=relation@entry=0x2c0, blockNum=<optimized out>) at bufmgr.c:1545
#34 0x0000558c75f68d08 in _bt_relandgetbuf (rel=0x2c0, rel@entry=0x7f8eb06bdc68, obuf=704, blkno=blkno@entry=1, access=-1335108504, access@entry=1) at nbtpage.c:707
#35 0x0000558c75f70016 in _bt_search (rel=rel@entry=0x7f8eb06bdc68, keysz=keysz@entry=3, scankey=scankey@entry=0x7ffe82344830, nextkey=nextkey@entry=0 '\000', bufP=bufP@entry=0x7ffe8234472c, access=access@entry=1, snapshot=0x558c78be44b0) at nbtsearch.c:166
#36 0x0000558c75f70bf4 in _bt_first (scan=scan@entry=0x558c78e034c8, dir=dir@entry=ForwardScanDirection) at nbtsearch.c:983
#37 0x0000558c75f6cbd2 in btgettuple (scan=0x558c78e034c8, dir=ForwardScanDirection) at nbtree.c:326
#38 0x0000558c75f63d5b in index_getnext_tid (scan=scan@entry=0x558c78e034c8, direction=direction@entry=ForwardScanDirection) at indexam.c:415
#39 0x0000558c760ac008 in IndexOnlyNext (node=node@entry=0x558c78e00ad8) at nodeIndexonlyscan.c:79
#40 0x0000558c7609c7d3 in ExecScanFetch (recheckMtd=0x558c760abf50 <IndexOnlyRecheck>, accessMtd=0x558c760abf90 <IndexOnlyNext>, node=0x558c78e00ad8) at execScan.c:95
#41 ExecScan (node=node@entry=0x558c78e00ad8, accessMtd=accessMtd@entry=0x558c760abf90 <IndexOnlyNext>, recheckMtd=recheckMtd@entry=0x558c760abf50 <IndexOnlyRecheck>) at execScan.c:145
#42 0x0000558c760ac4ce in ExecIndexOnlyScan (node=node@entry=0x558c78e00ad8) at nodeIndexonlyscan.c:261
#43 0x0000558c76094d88 in ExecProcNode (node=node@entry=0x558c78e00ad8) at execProcnode.c:431
#44 0x0000558c760b1d81 in ExecNestLoop (node=node@entry=0x558c78dfe0a8) at nodeNestloop.c:174
#45 0x0000558c76094ce8 in ExecProcNode (node=node@entry=0x558c78dfe0a8) at execProcnode.c:476
#46 0x0000558c760afeb9 in ExecModifyTable (node=node@entry=0x558c78dfdae8) at nodeModifyTable.c:1392
#47 0x0000558c76094df8 in ExecProcNode (node=node@entry=0x558c78dfdae8) at execProcnode.c:396
#48 0x0000558c76090e93 in ExecPostprocessPlan (estate=0x558c78df03f0) at execMain.c:1421
#49 standard_ExecutorFinish (queryDesc=0x558c78e48520) at execMain.c:407
#50 0x00007f8eb3d9ec75 in pgss_ExecutorFinish (queryDesc=0x558c78e48520) at pg_stat_statements.c:900
#51 0x00007f8eb3992e43 in pgqs_ExecutorFinish (queryDesc=0x558c78e48520) at pg_qualstats.c:592
#52 0x0000558c760c072f in _SPI_pquery (tcount=<optimized out>, fire_triggers=<optimized out>, queryDesc=<optimized out>) at spi.c:2424
#53 _SPI_execute_plan (plan=plan@entry=0x558c78c44df0, paramLI=paramLI@entry=0x558c78d30888, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#54 0x0000558c760c0c2a in SPI_execute_plan_with_paramlist (plan=0x558c78c44df0, params=0x558c78d30888, read_only=<optimized out>, tcount=1) at spi.c:446
#55 0x00007f8eb073aad2 in exec_stmt_execsql (estate=estate@entry=0x7ffe82346470, stmt=stmt@entry=0x558c78c0d418) at pl_exec.c:3515
#56 0x00007f8eb073bbc6 in exec_stmt (stmt=0x558c78c0d418, estate=0x7ffe82346470) at pl_exec.c:1503
#57 exec_stmts (estate=0x7ffe82346470, stmts=<optimized out>) at pl_exec.c:1398
#58 0x00007f8eb073e627 in exec_stmt_block (estate=estate@entry=0x7ffe82346470, block=0x558c78c0d868) at pl_exec.c:1336
#59 0x00007f8eb073e845 in plpgsql_exec_function (func=func@entry=0x558c78b24350, fcinfo=fcinfo@entry=0x558c78d2acc8, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#60 0x00007f8eb0732f95 in plpgsql_call_handler (fcinfo=0x558c78d2acc8) at pl_handler.c:255
#61 0x0000558c76095d64 in ExecMakeFunctionResultNoSets (fcache=0x558c78d2ac58, econtext=0x558c78d2aa30, isNull=0x558c78d2b608 "\177~\177\177\177\177\177\177\060\r\315x\214U", isDone=<optimized out>) at execQual.c:2041
#62 0x0000558c7609c24e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe823467cc, itemIsDone=0x558c78d2b748, isnull=0x558c78d2b608 "\177~\177\177\177\177\177\177\060\r\315x\214U", values=0x558c78d2b5e8, econtext=0x558c78d2aa30, targetlist=0x558c78d2b710) at execQual.c:5423
#63 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe823467cc) at execQual.c:5647
#64 0x0000558c760b3169 in ExecResult (node=node@entry=0x558c78d2a918) at nodeResult.c:155
#65 0x0000558c76094e08 in ExecProcNode (node=node@entry=0x558c78d2a918) at execProcnode.c:392
#66 0x0000558c76090c8e in ExecutePlan (dest=0x558c76753a60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558c78d2a918, estate=0x558c78d2a800) at execMain.c:1569
#67 standard_ExecutorRun (queryDesc=0x558c78d28c10, direction=<optimized out>, count=0) at execMain.c:338
#68 0x00007f8eb3d9ed5d in pgss_ExecutorRun (queryDesc=0x558c78d28c10, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#69 0x00007f8eb3992f25 in pgqs_ExecutorRun (queryDesc=0x558c78d28c10, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#70 0x0000558c760c087f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x558c78d28c10) at spi.c:2412
#71 _SPI_execute_plan (plan=plan@entry=0x7ffe82346bd0, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#72 0x0000558c760c0ac3 in SPI_execute (src=src@entry=0x558c78d15508 "SELECT powa_statements_snapshot()", read_only=<optimized out>, tcount=tcount@entry=0) at spi.c:383
#73 0x00007f8eb073d186 in exec_stmt_dynexecute (stmt=0x558c78b5a0e0, estate=0x7ffe823472d0) at pl_exec.c:3707
#74 exec_stmt (stmt=0x558c78b5a0e0, estate=0x7ffe823472d0) at pl_exec.c:1507
#75 exec_stmts (estate=0x7ffe823472d0, stmts=<optimized out>) at pl_exec.c:1398
#76 0x00007f8eb073e4d3 in exec_stmt_block (estate=estate@entry=0x7ffe823472d0, block=block@entry=0x558c78b5b238) at pl_exec.c:1197
#77 0x00007f8eb073bdff in exec_stmt (stmt=0x558c78b5b238, estate=0x7ffe823472d0) at pl_exec.c:1431
#78 exec_stmts (estate=0x7ffe823472d0, stmts=<optimized out>) at pl_exec.c:1398
#79 0x00007f8eb073f6f1 in exec_for_query (estate=estate@entry=0x7ffe823472d0, stmt=stmt@entry=0x558c78b598f8, portal=0x558c78bb6290, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
#80 0x00007f8eb073c03e in exec_stmt_fors (stmt=0x558c78b598f8, estate=0x7ffe823472d0) at pl_exec.c:2116
#81 exec_stmt (stmt=0x558c78b598f8, estate=0x7ffe823472d0) at pl_exec.c:1467
#82 exec_stmts (estate=0x7ffe823472d0, stmts=<optimized out>) at pl_exec.c:1398
#83 0x00007f8eb073e627 in exec_stmt_block (estate=estate@entry=0x7ffe823472d0, block=0x558c78b60970) at pl_exec.c:1336
#84 0x00007f8eb073e845 in plpgsql_exec_function (func=func@entry=0x558c78b1cc80, fcinfo=fcinfo@entry=0x558c78b75098, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#85 0x00007f8eb0732f95 in plpgsql_call_handler (fcinfo=0x558c78b75098) at pl_handler.c:255
#86 0x0000558c76095d64 in ExecMakeFunctionResultNoSets (fcache=0x558c78b75028, econtext=0x558c78b74e00, isNull=0x558c78b759d8 "\200~rx\214U", isDone=<optimized out>) at execQual.c:2041
#87 0x0000558c7609c24e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe8234762c, itemIsDone=0x558c78b75b18, isnull=0x558c78b759d8 "\200~rx\214U", values=0x558c78b759b8, econtext=0x558c78b74e00, targetlist=0x558c78b75ae0) at execQual.c:5423
#88 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe8234762c) at execQual.c:5647
#89 0x0000558c760b3169 in ExecResult (node=node@entry=0x558c78b74ce8) at nodeResult.c:155
#90 0x0000558c76094e08 in ExecProcNode (node=node@entry=0x558c78b74ce8) at execProcnode.c:392
#91 0x0000558c76090c8e in ExecutePlan (dest=0x558c76753a60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558c78b74ce8, estate=0x558c78b74bd0) at execMain.c:1569
#92 standard_ExecutorRun (queryDesc=0x558c78d0cb50, direction=<optimized out>, count=0) at execMain.c:338
#93 0x00007f8eb3d9ed5d in pgss_ExecutorRun (queryDesc=0x558c78d0cb50, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#94 0x00007f8eb3992f25 in pgqs_ExecutorRun (queryDesc=0x558c78d0cb50, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#95 0x0000558c760c087f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x558c78d0cb50) at spi.c:2412
#96 _SPI_execute_plan (plan=plan@entry=0x7ffe82347a30, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#97 0x0000558c760c0ac3 in SPI_execute (src=src@entry=0x7f8eb3b9a398 "SELECT powa_take_snapshot()", read_only=read_only@entry=0 '\000', tcount=tcount@entry=0) at spi.c:383
#98 0x00007f8eb3b99fc2 in powa_main (main_arg=<optimized out>) at powa.c:288
#99 0x0000558c7614160b in StartBackgroundWorker () at bgworker.c:742
#100 0x0000558c7614d135 in do_start_bgworker (rw=0x558c786dbba0) at postmaster.c:5612
#101 maybe_start_bgworkers () at postmaster.c:5809
#102 0x0000558c7614d4a3 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2834
#103 <signal handler called>
#104 0x00007f8eb42a3273 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#105 0x0000558c75f0e9be in ServerLoop () at postmaster.c:1683
#106 0x0000558c7614f153 in PostmasterMain (argc=3, argv=0x558c786b75b0) at postmaster.c:1327
#107 0x0000558c75f1069d in main (argc=3, argv=0x558c786b75b0) at main.c:228
rjuju commented 6 years ago

Thanks again for the report! I definitely agree about the signal handler, I'll refactor the code to use a flag. BTW, I think only elog(ERROR) and above are promoted to PANIC in a critsection (not a good reason to use elog inside such sections though). And thanks also for sqlsmith, great tool as usual :)

rjuju commented 6 years ago

@anse1 do you still have the query(ies) that raised this issue, or should I try to hit it running sqlsmith?

anse1 commented 6 years ago

Julien Rouhaud writes:

@anse1 do you still have the query(ies) that raised this issue, or should I try to hit it running sqlsmith?

I don't think that the queries themselves are very relevant to the signal handler problems and sqlsmith might also not be the fastest way to reproduce it, since it can take quite a while until sqlsmith by itself calls pg_reload_conf() just in the right moment the powa worker is doing something.

I guess a better receipe for this issue would be to configure an aggressive sampling interval for the powa worker on a big, heaviy loaded database to make sure the powa worker is generally very busy and then issue a statement like the following to give the handler a high chance of clobbering something:

| select pg_reload_conf(), pg_sleep(0.05) from generate_series(1,1000);

Alternatively, stopping the powa worker at interesting positions with gdb and manually issuing a SIGHUP should also be an efficient way to trigger these problems.

regards, Andreas

anse1 commented 6 years ago

Digging some more through the collected core dumps, I also found the one below with a backtrace that doesn't sport the signal handler. Looking at the code, the failed assertion did hold a couple of lines earlier in _bt_page_recyclable(page). Since I don't see code in between that could change the outcome of the assertion, I think that the signal handler was the culprit here as well.

Core was generated by `postgres: bgworker: powa   snapshot                              '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f28c03bf40a in __GI_abort () at abort.c:89
#2  0x000055b6718afec3 in ExceptionalCondition (conditionName=conditionName@entry=0x55b6718f3950 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=errorType@entry=0x55b6718f329d "FailedAssertion", fileName=fileName@entry=0x55b6718f38c8 "../../../../src/include/storage/bufpage.h", lineNumber=lineNumber@entry=314) at assert.c:54
#3  0x000055b67156316e in PageValidateSpecialPointer (page=0x7f28bdf35300 "") at ../../../../src/include/storage/bufpage.h:314
#4  _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, access=access@entry=2, blkno=18) at nbtpage.c:629
#5  0x000055b67156321c in _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, blkno=blkno@entry=4294967295, access=access@entry=2) at nbtpage.c:573
#6  0x000055b6715605f5 in _bt_split (newitemonleft=1 '\001', newitem=0x55b674471670, newitemsz=16, newitemoff=2, firstright=192, cbuf=0, buf=1628, rel=0x7f28bc88f1e0) at nbtinsert.c:986
#7  _bt_insertonpg (rel=rel@entry=0x7f28bc88f1e0, buf=buf@entry=1628, cbuf=cbuf@entry=0, stack=stack@entry=0x55b674471600, itup=itup@entry=0x55b674471670, newitemoff=2, split_only_page=0 '\000') at nbtinsert.c:781
#8  0x000055b6715628d4 in _bt_doinsert (rel=rel@entry=0x7f28bc88f1e0, itup=itup@entry=0x55b674471670, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, heapRel=heapRel@entry=0x7f28bc887be0) at nbtinsert.c:204
#9  0x000055b671566ad1 in btinsert (rel=0x7f28bc88f1e0, values=<optimized out>, isnull=<optimized out>, ht_ctid=0x55b673635134, heapRel=0x7f28bc887be0, checkUnique=UNIQUE_CHECK_NO) at nbtree.c:284
#10 0x000055b671689e92 in ExecInsertIndexTuples (slot=slot@entry=0x55b6736fddf0, tupleid=tupleid@entry=0x55b673635134, estate=estate@entry=0x55b6736ea9d0, noDupErr=noDupErr@entry=0 '\000', specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at execIndexing.c:388
#11 0x000055b6716aa9c9 in ExecInsert (canSetTag=0 '\000', estate=0x55b6736ea9d0, onconflict=<optimized out>, arbiterIndexes=0x0, planSlot=0x55b6736fddf0, slot=0x55b6736fddf0, mtstate=0x55b6736f80c8) at nodeModifyTable.c:496
#12 ExecModifyTable (node=node@entry=0x55b6736f80c8) at nodeModifyTable.c:1511
#13 0x000055b67168edf8 in ExecProcNode (node=node@entry=0x55b6736f80c8) at execProcnode.c:396
#14 0x000055b67168ae93 in ExecPostprocessPlan (estate=0x55b6736ea9d0) at execMain.c:1421
#15 standard_ExecutorFinish (queryDesc=0x55b6735b5198) at execMain.c:407
#16 0x00007f28bff67c75 in pgss_ExecutorFinish (queryDesc=0x55b6735b5198) at pg_stat_statements.c:900
#17 0x00007f28bfb5be43 in pgqs_ExecutorFinish (queryDesc=0x55b6735b5198) at pg_qualstats.c:592
#18 0x000055b6716ba72f in _SPI_pquery (tcount=<optimized out>, fire_triggers=<optimized out>, queryDesc=<optimized out>) at spi.c:2424
#19 _SPI_execute_plan (plan=plan@entry=0x55b673354c90, paramLI=paramLI@entry=0x55b673983cf8, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#20 0x000055b6716bac2a in SPI_execute_plan_with_paramlist (plan=0x55b673354c90, params=0x55b673983cf8, read_only=<optimized out>, tcount=1) at spi.c:446
#21 0x00007f28bc903ad2 in exec_stmt_execsql (estate=estate@entry=0x7ffe72a8b4b0, stmt=stmt@entry=0x55b67331d2b8) at pl_exec.c:3515
#22 0x00007f28bc904bc6 in exec_stmt (stmt=0x55b67331d2b8, estate=0x7ffe72a8b4b0) at pl_exec.c:1503
#23 exec_stmts (estate=0x7ffe72a8b4b0, stmts=<optimized out>) at pl_exec.c:1398
#24 0x00007f28bc907627 in exec_stmt_block (estate=estate@entry=0x7ffe72a8b4b0, block=0x55b67331d708) at pl_exec.c:1336
#25 0x00007f28bc907845 in plpgsql_exec_function (func=func@entry=0x55b673246ff0, fcinfo=fcinfo@entry=0x55b67397e138, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#26 0x00007f28bc8fbf95 in plpgsql_call_handler (fcinfo=0x55b67397e138) at pl_handler.c:255
#27 0x000055b67168fd64 in ExecMakeFunctionResultNoSets (fcache=0x55b67397e0c8, econtext=0x55b67397dea0, isNull=0x55b67397ea78 "\177~\177\177\177\177\177\177p\244Ps\266U", isDone=<optimized out>) at execQual.c:2041
#28 0x000055b67169624e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe72a8b80c, itemIsDone=0x55b67397ebb8, isnull=0x55b67397ea78 "\177~\177\177\177\177\177\177p\244Ps\266U", values=0x55b67397ea58, econtext=0x55b67397dea0, targetlist=0x55b67397eb80) at execQual.c:5423
#29 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe72a8b80c) at execQual.c:5647
#30 0x000055b6716ad169 in ExecResult (node=node@entry=0x55b67397dd88) at nodeResult.c:155
#31 0x000055b67168ee08 in ExecProcNode (node=node@entry=0x55b67397dd88) at execProcnode.c:392
#32 0x000055b67168ac8e in ExecutePlan (dest=0x55b671d4da60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<opgtimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b67397dd88, estate=0x55b67397dc70) at execMain.c:1569
#33 standard_ExecutorRun (queryDesc=0x55b67397c080, direction=<optimized out>, count=0) at execMain.c:338
#34 0x00007f28bff67d5d in pgss_ExecutorRun (queryDesc=0x55b67397c080, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#35 0x00007f28bfb5bf25 in pgqs_ExecutorRun (queryDesc=0x55b67397c080, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#36 0x000055b6716ba87f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x55b67397c080) at spi.c:2412
#37 _SPI_execute_plan (plan=plan@entry=0x7ffe72a8bc10, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#38 0x000055b6716baac3 in SPI_execute (src=src@entry=0x55b673968978 "SELECT powa_statements_snapshot()", read_only=<optimized out>, tcount=tcount@entry=0) at spi.c:383
#39 0x00007f28bc906186 in exec_stmt_dynexecute (stmt=0x55b673269f80, estate=0x7ffe72a8c310) at pl_exec.c:3707
#40 exec_stmt (stmt=0x55b673269f80, estate=0x7ffe72a8c310) at pl_exec.c:1507
#41 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#42 0x00007f28bc9074d3 in exec_stmt_block (estate=estate@entry=0x7ffe72a8c310, block=block@entry=0x55b67326b0d8) at pl_exec.c:1197
#43 0x00007f28bc904dff in exec_stmt (stmt=0x55b67326b0d8, estate=0x7ffe72a8c310) at pl_exec.c:1431
#44 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#45 0x00007f28bc9086f1 in exec_for_query (estate=estate@entry=0x7ffe72a8c310, stmt=stmt@entry=0x55b673269798, portal=0x55b6732c6130, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
#46 0x00007f28bc90503e in exec_stmt_fors (stmt=0x55b673269798, estate=0x7ffe72a8c310) at pl_exec.c:2116
#47 exec_stmt (stmt=0x55b673269798, estate=0x7ffe72a8c310) at pl_exec.c:1467
#48 exec_stmts (estate=0x7ffe72a8c310, stmts=<optimized out>) at pl_exec.c:1398
#49 0x00007f28bc907627 in exec_stmt_block (estate=estate@entry=0x7ffe72a8c310, block=0x55b673270810) at pl_exec.c:1336
#50 0x00007f28bc907845 in plpgsql_exec_function (func=func@entry=0x55b67323f920, fcinfo=fcinfo@entry=0x55b673727058, simple_eval_estate=simple_eval_estate@entry=0x0) at pl_exec.c:434
#51 0x00007f28bc8fbf95 in plpgsql_call_handler (fcinfo=0x55b673727058) at pl_handler.c:255
#52 0x000055b67168fd64 in ExecMakeFunctionResultNoSets (fcache=0x55b673726fe8, econtext=0x55b673726dc0, isNull=0x55b673727998 "\b~", isDone=<optimized out>) at execQual.c:2041
#53 0x000055b67169624e in ExecTargetList (tupdesc=<optimized out>, isDone=0x7ffe72a8c66c, itemIsDone=0x55b673727ad8, isnull=0x55b673727998 "\b~", values=0x55b673727978, econtext=0x55b673726dc0, targetlist=0x55b673727aa0) at execQual.c:5423
#54 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffe72a8c66c) at execQual.c:5647
#55 0x000055b6716ad169 in ExecResult (node=node@entry=0x55b673726ca8) at nodeResult.c:155
#56 0x000055b67168ee08 in ExecProcNode (node=node@entry=0x55b673726ca8) at execProcnode.c:392
#57 0x000055b67168ac8e in ExecutePlan (dest=0x55b671d4da60 <spi_printtupDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b673726ca8, estate=0x55b673726b90) at execMain.c:1569
#58 standard_ExecutorRun (queryDesc=0x55b673722fa0, direction=<optimized out>, count=0) at execMain.c:338
#59 0x00007f28bff67d5d in pgss_ExecutorRun (queryDesc=0x55b673722fa0, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#60 0x00007f28bfb5bf25 in pgqs_ExecutorRun (queryDesc=0x55b673722fa0, direction=ForwardScanDirection, count=0) at pg_qualstats.c:564
#61 0x000055b6716ba87f in _SPI_pquery (tcount=0, fire_triggers=1 '\001', queryDesc=0x55b673722fa0) at spi.c:2412
#62 _SPI_execute_plan (plan=plan@entry=0x7ffe72a8ca70, paramLI=paramLI@entry=0x0, snapshot=snapshot@entry=0x0, crosscheck_snapshot=crosscheck_snapshot@entry=0x0, read_only=read_only@entry=0 '\000', fire_triggers=fire_triggers@entry=1 '\001', tcount=<optimized out>) at spi.c:2188
#63 0x000055b6716baac3 in SPI_execute (src=src@entry=0x7f28bfd63398 "SELECT powa_take_snapshot()", read_only=read_only@entry=0 '\000', tcount=tcount@entry=0) at spi.c:383
#64 0x00007f28bfd62fc2 in powa_main (main_arg=<optimized out>) at powa.c:288
#65 0x000055b67173b60b in StartBackgroundWorker () at bgworker.c:742
#66 0x000055b671747135 in do_start_bgworker (rw=0x55b672ed1ba0) at postmaster.c:5612
#67 maybe_start_bgworkers () at postmaster.c:5809
#68 0x000055b6717474a3 in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2834
#69 <signal handler called>
#70 0x00007f28c046c273 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#71 0x000055b6715089be in ServerLoop () at postmaster.c:1683
#72 0x000055b671749153 in PostmasterMain (argc=3, argv=0x55b672ead5b0) at postmaster.c:1327
#73 0x000055b67150a69d in main (argc=3, argv=0x55b672ead5b0) at main.c:228
(gdb) p debug_query_string
$1 = 0x0
(gdb) frame 3
#3  0x000055b67156316e in PageValidateSpecialPointer (page=0x7f28bdf35300 "") at ../../../../src/include/storage/bufpage.h:314
(gdb) up
#4  _bt_getbuf (rel=rel@entry=0x7f28bc88f1e0, access=access@entry=2, blkno=18) at nbtpage.c:629
(gdb) list
624                      * WAL record that will allow us to conflict with queries
625                      * running on standby.
626                      */
627                     if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
628                     {
629                         BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
630 
631                         _bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
632                     }
633 
(gdb) list -
614             if (blkno == InvalidBlockNumber)
615                 break;
616             buf = ReadBuffer(rel, blkno);
617             if (ConditionalLockBuffer(buf))
618             {
619                 page = BufferGetPage(buf);
620                 if (_bt_page_recyclable(page))
621                 {
622                     /*
623                      * If we are generating WAL for Hot Standby then create a
anse1 commented 6 years ago

Just to clarify: I didn't notice your fix on the branch, so my last report was still about a core dump with the old handler.

I'll do some testing with the new handler this evening.

rjuju commented 6 years ago

Thanks! Actually I forgot to push it yesterday :/ FTR, @marco44 has also been doing some testing on it, and it looks ok. He couldn't reproduce the same problem as what's reported here though. I hope I'll also be able to do some more testing too soon.

anse1 commented 6 years ago

Looking good: No crashes in powa so far with the new handler of 3e71401602dfa while I had about 50 yesterday with the same amount of stress testing done.

regards, andreas

rjuju commented 6 years ago

Thanks a lot for the confirmation!

I'll do some more testing to make sure the frequency is updated as needed in every situation and push the fix.

rjuju commented 6 years ago

Sorry for late replied. I double checked the patch and everything looks goof to me, so I just pushed 3e71401602dfa945ec4b513df507b84389fc2393. Thanks again for reporting this issue!

rjuju commented 6 years ago

I just released new version including this bugfix.

Thanks!