powa-team / pg_qualstats

A PostgreSQL extension for collecting statistics about predicates, helping find what indices are missing
Other
272 stars 26 forks source link

Segmentation fault #30

Closed seqizz closed 4 years ago

seqizz commented 4 years ago

Hi there :wave:

After an upgrade of qualstats (from 1.0.9 to 2.0.1), I've started to get segfaults which causes postgres to restart while powa taking snapshots.

Some environment info:

Error from postgres logs:

2020-05-06 14:44:34 UTC [4833-62] LOG:  background worker "powa" (PID 19562) was terminated by signal 11: Segmentation fault
2020-05-06 14:44:34 UTC [4833-63] DETAIL:  Failed process was running: SELECT powa_take_snapshot()
2020-05-06 14:44:34 UTC [4833-64] LOG:  terminating any other active server processes
2020-05-06 14:44:34 UTC [32741-1] user1@10.17.1.52 WARNING:  terminating connection because of crash of another server process
2020-05-06 14:44:34 UTC [32741-2] user1@10.17.1.52 DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another s
erver process exited abnormally and possibly corrupted shared memory.
2020-05-06 14:44:34 UTC [32741-3] user1@10.17.1.52 HINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
...
2020-05-06 14:44:34 UTC [4833-65] LOG:  all server processes terminated; reinitializing
2020-05-06 14:44:34 UTC [16293-1] LOG:  database system was interrupted; last known up at 2020-05-06 14:20:03 UTC
2020-05-06 14:44:34 UTC [16328-1] user2@[local] FATAL:  the database system is in recovery mode
2020-05-06 14:44:35 UTC [16337-1] user2@[local] FATAL:  the database system is in recovery mode
2020-05-06 14:44:35 UTC [16293-2] LOG:  database system was not properly shut down; automatic recovery in progress
2020-05-06 14:44:35 UTC [16293-3] LOG:  redo starts at 51/327D5540
2020-05-06 14:44:35 UTC [16293-4] LOG:  redo done at 51/32935E98
2020-05-06 14:44:35 UTC [16293-5] LOG:  last completed transaction was at log time 2020-05-06 14:39:34.381339+00
2020-05-06 14:44:35 UTC [16293-6] LOG:  checkpoint starting: end-of-recovery immediate
2020-05-06 14:44:35 UTC [16293-7] LOG:  checkpoint complete: wrote 169 buffers (0.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.002 s, total=0.011 s; sync files=15, longest=0.001 s, average=0.000 s; distance=1410 kB, estimate=1410 kB
2020-05-06 14:44:35 UTC [4833-66] LOG:  database system is ready to accept connections

Meanwhile at kernel log:

[18579089.750187] traps: postgres[25395] general protection ip:5619cf6a452d sp:7ffea07b5070 error:0 in postgres[5619cf699000+465000]
[18584190.492834] postgres[19562]: segfault at 0 ip 00005619cf6a452d sp 00007ffea07b5070 error 4 in postgres[5619cf699000+465000]
[18584190.492843] Code: 43 01 49 83 c4 70 4c 39 fb 74 72 48 89 c3 41 80 3c 1e 00 75 e9 49 0f bf 04 24 48 8b 3c de 66 83 f8 ff 75 a9 41 80 7c 24 11 70 <0f> b6 17 74 1e f6 c2 03 75 19 8b 0f c1 e9 02 83 e9 04 4c 8d 41 01

(After the upgrade, I've manually executed drop extension + create extension for pg_qualstats under powa database to upgrade it to v2.0.1)

I couldn't find anything specific I need to apply on the upgrade, any suggestions?

Thanks in advance :+1:

rjuju commented 4 years ago

Hi,

Ouch :( Did you restart postgres after upgrading pg_qualstats package?

Does the problem still happens? Is there any chance you have a core dump somewhere and possibility to install the debug packages?

rjuju commented 4 years ago

For the record I just tried with this environment setup from scratch:

powa=# \dx pg_qualstats 
                           List of installed extensions
     Name     | Version | Schema |                  Description                   
--------------+---------+--------+------------------------------------------------
 pg_qualstats | 2.0.1   | public | An extension collecting statistics about quals
(1 row)

# cat /etc/debian_version 
10.3

and I don't hit any issue.

seqizz commented 4 years ago

Ouch :( Did you restart postgres after upgrading pg_qualstats package?

Not actually, since it didn't hit an important server yet and postgres keeping itself up afterwards, I am keeping the status.

Does the problem still happens? Is there any chance you have a core dump somewhere and possibility to install the debug packages?

Not sure how to get the dump, I'll check the available packages, possibly it will need a db restart.. I'll try to install lower version and upgrade if it doesn't fail easily. šŸ‘

seqizz commented 4 years ago

Heh, I was lucky. As soon as I've installed systemd-coredump, I got a new one. Attaching.

coredump-postgres.zip

rjuju commented 4 years ago

Thanks, that's quite helpful.

So the backtrace is:

#0  0x00005619cf6a452d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x5619d1863060, 
    values=values@entry=0x5619d1864290, isnull=isnull@entry=0x5619d18643a0)
    at ./build/../src/backend/access/common/heaptuple.c:175
#1  0x00005619cf6a63fd in heap_form_minimal_tuple (tupleDescriptor=0x5619d1863060, 
    values=0x5619d1864290, isnull=0x5619d18643a0)
    at ./build/../src/backend/access/common/heaptuple.c:1824
#2  0x00005619cfaeb9b0 in tuplestore_putvalues (state=0x5619d1864070, tdesc=<optimized out>, 
    values=<optimized out>, isnull=<optimized out>)
    at ./build/../src/backend/utils/sort/tuplestore.c:756
#3  0x00007f515181f171 in pgqs_process_opexpr (expr=0x5619d18643a0, context=0x5619d1864290)
    at pg_qualstats.c:1644
[...]

It looks like the problem is that the loaded version and the extension version didn't match, leading to this bad behavior. I guess that we should really handle ABI more cleanly than we do, as not restarting postgres when updating the extension version can lead to a crash.

I'll work on a new version soon that takes care of that. In the mean time, I think that now that your server did restart the crash shouldn't happen anymore. If that's not a production server (or if you have a similar and non critical environment), could you confirm?

akovac commented 4 years ago

Hi rjuju,

I'm also affected by same issue after upgrade.

ii postgresql-12-pg-qualstats 2.0.1-1.pgdg100+1 amd64 PostgreSQL extension to gather statistics about predicates.

Thing is that restart of postgres did not resolve issue. I had to downgrade to

ii postgresql-12-pg-qualstats 1.0.9-1.pgdg100+1 amd64 PostgreSQL extension to gather statistics about predicates.

If more information is needed I'm willing to reproduce.

EDIT: relevant log after restart:

2020-05-06 21:26:30.989 UTC [166] LOG:  database system is shut down
2020-05-06 23:26:44.047 CEST [27958] LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-05-06 23:26:44.047 CEST [27958] LOG:  pg_stat_kcache.linux_hz is set to 1000000
2020-05-06 23:26:44.048 CEST [27958] LOG:  starting PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-b
it
2020-05-06 23:26:44.048 CEST [27958] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-05-06 23:26:44.048 CEST [27958] LOG:  listening on IPv6 address "::", port 5432
2020-05-06 23:26:44.048 CEST [27958] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-06 23:26:44.231 CEST [27976] LOG:  database system was shut down at 2020-05-06 23:26:30 CEST
2020-05-06 23:26:44.235 CEST [27977] zabbix@mirex FATAL:  the database system is starting up
2020-05-06 23:26:44.236 CEST [27958] LOG:  database system is ready to accept connections
2020-05-06 23:26:44.240 CEST [27984] LOG:  POWA connected to database powa
2020-05-06 23:26:45.804 CEST [28012] repmgr@[unknown] LOG:  received replication command: IDENTIFY_SYSTEM
2020-05-06 23:26:45.804 CEST [28012] repmgr@[unknown] LOG:  received replication command: START_REPLICATION SLOT "repmgr_slot_3" 178/6B000000 TIMELINE 6
2020-05-06 23:31:44.615 CEST [27958] LOG:  background worker " ^N)9<98>U" (PID 27984) was terminated by signal 11: Segmentation fault
2020-05-06 23:31:44.615 CEST [27958] DETAIL:  Failed process was running: SELECT powa_take_snapshot()
2020-05-06 23:31:44.615 CEST [27958] LOG:  terminating any other active server processes

Br

seqizz commented 4 years ago

I'll work on a new version soon that takes care of that. In the mean time, I think that now that your server did restart the crash shouldn't happen anymore. If that's not a production server (or if you have a similar and non critical environment), could you confirm?

Thanks for the interest and quick reply. Sadly I've already reverted to old version of the plugin due to constant crash after supplying the dump :\ So I can't say something about the restart case.

But as a rookie question: Wouldn't LOG: all server processes terminated; reinitializing imply there was always postgres restart in case of segfault, would something change with systemd/init level restart?

rjuju commented 4 years ago

Actually after looking at the coredump again this morning, it doesn't really make sense:

#0  0x00005619cf6a452d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x5619d1863060, values=values@entry=0x5619d1864290, isnull=isnull@entry=0x5619d18643a0)
    at ./build/../src/backend/access/common/heaptuple.c:175
#1  0x00005619cf6a63fd in heap_form_minimal_tuple (tupleDescriptor=0x5619d1863060, values=0x5619d1864290, isnull=0x5619d18643a0) at ./build/../src/backend/access/common/heaptuple.c:1824
#2  0x00005619cfaeb9b0 in tuplestore_putvalues (state=0x5619d1864070, tdesc=<optimized out>, values=<optimized out>, isnull=<optimized out>) at ./build/../src/backend/utils/sort/tuplestore.c:756
#3  0x00007f515181f171 in pgqs_process_opexpr (expr=0x5619d18643a0, context=0x5619d1864290) at pg_qualstats.c:1644
#4  0x00005619cf845e0c in ExecMakeTableFunctionResult (setexpr=0x5619d15b4018, econtext=0x5619d15b3f38, argContext=<optimized out>, expectedDesc=<optimized out>, randomAccess=<optimized out>)
    at ./build/../src/backend/executor/execSRF.c:231
#5  0x00005619cf8528c1 in FunctionNext (node=node@entry=0x5619d15b3e28) at ./build/../src/backend/executor/nodeFunctionscan.c:94
#6  0x00005619cf8451c9 in ExecScanFetch (recheckMtd=0x5619cf8525c0 <FunctionRecheck>, accessMtd=0x5619cf8525f0 <FunctionNext>, node=0x5619d15b3e28) at ./build/../src/backend/executor/execScan.c:95

pgqs_process_opexpr should be called in ExecutorEnd(), not in the middle of execution. Also, pgqs_process_opexpr doesn't call tuplestore_putvalues(). I'm not sure what's wrong with this coredump but it's unfortunately not enough to understand where the problem is.

I installed the same environment with original debian packages, and I'm not able to reproduce the issue.

@akovac Thanks also for the report!

@akovac @seqizz Is there any chance that you can get reproduce the issue, get the stacktrace while having the -dbgsym packages installed? (you can check https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD if you need some pointers). Also, does the crash always happens when calling manually:

seqizz commented 4 years ago

Ehmm.. after downgrading the package and drop/create extension again, I still get crashes with previous version. DB is also restarted, not sure what is going on.

I am not sure if this is sufficient, stacktrace from the local dump, maybe dbgsym on local helps: bt.txt

I'm also running postgres under GDB currently but I am unable to trigger the crash with manual commands, I'll try to upgrade the plugin on this one and report back with trace if crashes.

akovac commented 4 years ago

@seqizz I'm fine after downgrade - no crashes. Difference is that I didn't do drop/create extension after first upgrade or later downgrade :-/ @rjuju I'll be able to try and reproduce a bit later after 17:00 as the system is in production so I'll need to wait first for some processes to finish.

seqizz commented 4 years ago

Aaah I've caught it at last. Here it is: bt2.txt

few lines missed in the beginning, in case it's helpful:

(gdb) cont
Continuing.

Thread 2.1 "postgres" hit Catchpoint 1 (signal SIGSEGV), 0x000055555561352d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x55555620d2f0,
    values=values@entry=0x55555620e520, isnull=isnull@entry=0x55555620e630) at ./build/../src/backend/access/common/heaptuple.c:175
175     ./build/../src/backend/access/common/heaptuple.c: No such file or directory.
(gdb) bt
... (rest is in the file)
akovac commented 4 years ago

I can reliably reproduce with both select powa_take_snapshot() and select count(*) from pg_qualstats

After downgrading and restarting potgresql server everything is fine again.

@rjuju I didn't do drop/create of extension (CREATE EXTENSION pg_qualstats) Is that required for proper upgrade?

If I run that same select from pg_qualstats query on the read-only standby server it doesn't trigger segfault. Both servers are running inside unprivileged LXC (Debian 10 / PostgreSQL 12) without any updates pending (except downgraded qualstats).

Backtrace from select count(*) from pg_qualstats:

#0  0x000055b165e4119d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x55b1673fb0e8, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/access/common/heaptuple.c:138
#1  0x000055b165e42b3c in heap_form_minimal_tuple (tupleDescriptor=0x55b1673fb0e8, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/access/common/heaptuple.c:1396
#2  0x000055b1662b0a10 in tuplestore_putvalues (state=state@entry=0x55b1673fb8f8, tdesc=<optimized out>, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/utils/sort/tuplestore.c:756
#3  0x00007fd40396f20e in pg_qualstats_common (fcinfo=<optimized out>, include_names=<optimized out>) at pg_qualstats.c:2083
#4  0x000055b165fef6b5 in ExecMakeTableFunctionResult (setexpr=0x55b1673dfcd8, econtext=0x55b1673dfbd8, argContext=<optimized out>, expectedDesc=0x55b1673ed088, randomAccess=false) at ./build/../src/backend/executor/execSRF.c:233
#5  0x000055b165ffdf3d in FunctionNext (node=node@entry=0x55b1673df560) at ./build/../src/backend/executor/nodeFunctionscan.c:94
#6  0x000055b165feec09 in ExecScanFetch (recheckMtd=0x55b165ffdc10 <FunctionRecheck>, accessMtd=0x55b165ffdc40 <FunctionNext>, node=0x55b1673df560) at ./build/../src/backend/executor/execScan.c:133
#7  ExecScan (node=0x55b1673df560, accessMtd=0x55b165ffdc40 <FunctionNext>, recheckMtd=0x55b165ffdc10 <FunctionRecheck>) at ./build/../src/backend/executor/execScan.c:183
#8  0x000055b16600368f in ExecProcNode (node=0x55b1673df560) at ./build/../src/include/executor/executor.h:239
#9  ExecHashJoinImpl (parallel=false, pstate=0x55b1673df918) at ./build/../src/backend/executor/nodeHashjoin.c:262
#10 ExecHashJoin (pstate=0x55b1673df918) at ./build/../src/backend/executor/nodeHashjoin.c:591
#11 0x000055b165ff765c in ExecProcNode (node=0x55b1673df918) at ./build/../src/include/executor/executor.h:239
#12 fetch_input_tuple (aggstate=aggstate@entry=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:406
#13 0x000055b165ff91e0 in agg_retrieve_direct (aggstate=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:1747
#14 ExecAgg (pstate=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:1562
#15 0x000055b165fe6233 in ExecProcNode (node=0x55b1673df350) at ./build/../src/include/executor/executor.h:239
#16 ExecutePlan (execute_once=<optimized out>, dest=0x55b1673c0060, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b1673df350, estate=0x55b1673df0f0) at ./build/../src/backend/executor/execMain.c:1646
#17 standard_ExecutorRun (queryDesc=0x55b1673b7b40, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:364
#18 0x00007fd4039810a5 in pgss_ExecutorRun (queryDesc=0x55b1673b7b40, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
#19 0x00007fd40396ecdd in pgqs_ExecutorRun (queryDesc=0x55b1673b7b40, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_qualstats.c:630
#20 0x000055b16614d16c in PortalRunSelect (portal=portal@entry=0x55b1672f3be0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x55b1673c0060) at ./build/../src/backend/tcop/pquery.c:929
#21 0x000055b16614e708 in PortalRun (portal=portal@entry=0x55b1672f3be0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x55b1673c0060, altdest=altdest@entry=0x55b1673c0060, completionTag=0x7ffcaaac6e90 "") at ./build/../src/backend/tcop/pquery.c:770
#22 0x000055b16614a4ac in exec_simple_query (query_string=0x55b16724cd90 "select count(*) from pg_qualstats;") at ./build/../src/backend/tcop/postgres.c:1215
#23 0x000055b16614bcfd in PostgresMain (argc=<optimized out>, argv=argv@entry=0x55b1672b6f58, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4247
#24 0x000055b1660d1549 in BackendRun (port=0x55b16729dac0, port=0x55b16729dac0) at ./build/../src/backend/postmaster/postmaster.c:4437
#25 BackendStartup (port=0x55b16729dac0) at ./build/../src/backend/postmaster/postmaster.c:4128
#26 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#27 0x000055b1660d246f in PostmasterMain (argc=7, argv=0x55b167246c40) at ./build/../src/backend/postmaster/postmaster.c:1377
#28 0x000055b165e38eed in main (argc=7, argv=0x55b167246c40) at ./build/../src/backend/main/main.c:228
Detaching from program: /usr/lib/postgresql/12/bin/postgres, process 18609
[Inferior 1 (process 18609) detached]
akovac commented 4 years ago

I forgot to mention that table is empty - this is when I revert to 1.0.9-1

prodpg@dbs1:~$ psql powa
psql (12.2 (Debian 12.2-2.pgdg100+1))
Type "help" for help.

powa=# select powa_take_snapshot();
 powa_take_snapshot 
--------------------

(1 row)

powa=# select count(*) from pg_qualstats;
 count 
-------
     0
(1 row)
akovac commented 4 years ago

After dropping and recreating extension everything is fine. @seqizz Is it possible that you just pasted CREATE EXTENSION IF NOT EXISTS pg_qualstats; (without dropping extension first)

seqizz commented 4 years ago

After dropping and recreating extension everything is fine. @seqizz Is it possible that you just pasted CREATE EXTENSION IF NOT EXISTS pg_qualstats; (without dropping extension first)

sadly i am sure that i reproduced the crash with following steps on pg11 (to get the last trace i posted)

rjuju commented 4 years ago

Thanks for the extensive testing.

Yes, unfortunately pg_qualstats isn't always safe if have a discrepancy between the loaded shared library (which requires a postgres restart) and the SQL object definition, that is when the ABI change. Once we clear out if there's another problem with v2.0.1, I'll add safe guards for that, but unfortunately it'll only help when moving from v1.x / v2.0.1 to v2.0.2, but 1.X vs 2.0.1 will still be problematic.

@seqizz those steps only ensure the version of the SQL object definition. The ABI mismatch that can cause crash also depends on the loaded version. So even doing a package upgrade/downgrade won't change the version loaded in postgres until you restart postgres. If you didn't restart postgres, this is actually the way to cause the ABI mismatch.

seqizz commented 4 years ago

@seqizz those steps only ensure the version of the SQL object definition. The ABI mismatch that can cause crash also depends on the loaded version. So even doing a package upgrade/downgrade won't change the version loaded in postgres until you restart postgres. If you didn't restart postgres, this is actually the way to cause the ABI mismatch.

Oh.. that hurts. I forgot about the hidden ABI mismatches. I remember having a similar experience with an upgrade of repmgr.

The thing is, we try to update extensions automatically via: alter extension *extension_name* update; And this is not allowed by message:

ERROR:  extension "pg_qualstats" has no update path from version "1.0.9" to version "2.0.1"

So after seeing drop/create doesn't complain, I thought that was safe enough. But evidently not. IMHO this should be also checked/prevented on the create extension step, if possible.

Thanks for the explanation btw. I'll keep in mind to block package upgrade on the hosts which can't afford downtime. :+1:

rjuju commented 4 years ago

So I just pushed a fix with https://github.com/powa-team/pg_qualstats/commit/326b55508ea4797e05445d66ff3969e2013346a4 based on pg_stat_statements implementation for API change. All cases now works as expected, except creating the new extension definition while still having the old lib loaded, which errors out rather than later crashing:

I also fixed pg_qualstats_names() which was also broken. If you have the possibility, could you double check that everything is now working as expected? If so I'll release a new version quickly.

seqizz commented 4 years ago

Yep, looks like it's failing like expected:

ERROR:  could not find function "pg_qualstats_2_0" in file "/usr/lib/postgresql/11/lib/pg_qualstats.so"

Thanks :+1:

rjuju commented 4 years ago

Great, thanks a lot! I just pushed version 2.0.2!