Closed seqizz closed 4 years ago
Hi,
Ouch :( Did you restart postgres after upgrading pg_qualstats package?
Does the problem still happens? Is there any chance you have a core dump somewhere and possibility to install the debug packages?
For the record I just tried with this environment setup from scratch:
powa=# \dx pg_qualstats
List of installed extensions
Name | Version | Schema | Description
--------------+---------+--------+------------------------------------------------
pg_qualstats | 2.0.1 | public | An extension collecting statistics about quals
(1 row)
# cat /etc/debian_version
10.3
and I don't hit any issue.
Ouch :( Did you restart postgres after upgrading pg_qualstats package?
Not actually, since it didn't hit an important server yet and postgres keeping itself up afterwards, I am keeping the status.
Does the problem still happens? Is there any chance you have a core dump somewhere and possibility to install the debug packages?
Not sure how to get the dump, I'll check the available packages, possibly it will need a db restart.. I'll try to install lower version and upgrade if it doesn't fail easily. š
Heh, I was lucky. As soon as I've installed systemd-coredump, I got a new one. Attaching.
Thanks, that's quite helpful.
So the backtrace is:
#0 0x00005619cf6a452d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x5619d1863060,
values=values@entry=0x5619d1864290, isnull=isnull@entry=0x5619d18643a0)
at ./build/../src/backend/access/common/heaptuple.c:175
#1 0x00005619cf6a63fd in heap_form_minimal_tuple (tupleDescriptor=0x5619d1863060,
values=0x5619d1864290, isnull=0x5619d18643a0)
at ./build/../src/backend/access/common/heaptuple.c:1824
#2 0x00005619cfaeb9b0 in tuplestore_putvalues (state=0x5619d1864070, tdesc=<optimized out>,
values=<optimized out>, isnull=<optimized out>)
at ./build/../src/backend/utils/sort/tuplestore.c:756
#3 0x00007f515181f171 in pgqs_process_opexpr (expr=0x5619d18643a0, context=0x5619d1864290)
at pg_qualstats.c:1644
[...]
It looks like the problem is that the loaded version and the extension version didn't match, leading to this bad behavior. I guess that we should really handle ABI more cleanly than we do, as not restarting postgres when updating the extension version can lead to a crash.
I'll work on a new version soon that takes care of that. In the mean time, I think that now that your server did restart the crash shouldn't happen anymore. If that's not a production server (or if you have a similar and non critical environment), could you confirm?
Hi rjuju,
I'm also affected by same issue after upgrade.
ii postgresql-12-pg-qualstats 2.0.1-1.pgdg100+1 amd64 PostgreSQL extension to gather statistics about predicates.
Thing is that restart of postgres did not resolve issue. I had to downgrade to
ii postgresql-12-pg-qualstats 1.0.9-1.pgdg100+1 amd64 PostgreSQL extension to gather statistics about predicates.
If more information is needed I'm willing to reproduce.
EDIT: relevant log after restart:
2020-05-06 21:26:30.989 UTC [166] LOG: database system is shut down
2020-05-06 23:26:44.047 CEST [27958] LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2020-05-06 23:26:44.047 CEST [27958] LOG: pg_stat_kcache.linux_hz is set to 1000000
2020-05-06 23:26:44.048 CEST [27958] LOG: starting PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-b
it
2020-05-06 23:26:44.048 CEST [27958] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-05-06 23:26:44.048 CEST [27958] LOG: listening on IPv6 address "::", port 5432
2020-05-06 23:26:44.048 CEST [27958] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-05-06 23:26:44.231 CEST [27976] LOG: database system was shut down at 2020-05-06 23:26:30 CEST
2020-05-06 23:26:44.235 CEST [27977] zabbix@mirex FATAL: the database system is starting up
2020-05-06 23:26:44.236 CEST [27958] LOG: database system is ready to accept connections
2020-05-06 23:26:44.240 CEST [27984] LOG: POWA connected to database powa
2020-05-06 23:26:45.804 CEST [28012] repmgr@[unknown] LOG: received replication command: IDENTIFY_SYSTEM
2020-05-06 23:26:45.804 CEST [28012] repmgr@[unknown] LOG: received replication command: START_REPLICATION SLOT "repmgr_slot_3" 178/6B000000 TIMELINE 6
2020-05-06 23:31:44.615 CEST [27958] LOG: background worker " ^N)9<98>U" (PID 27984) was terminated by signal 11: Segmentation fault
2020-05-06 23:31:44.615 CEST [27958] DETAIL: Failed process was running: SELECT powa_take_snapshot()
2020-05-06 23:31:44.615 CEST [27958] LOG: terminating any other active server processes
Br
I'll work on a new version soon that takes care of that. In the mean time, I think that now that your server did restart the crash shouldn't happen anymore. If that's not a production server (or if you have a similar and non critical environment), could you confirm?
Thanks for the interest and quick reply. Sadly I've already reverted to old version of the plugin due to constant crash after supplying the dump :\ So I can't say something about the restart case.
But as a rookie question: Wouldn't LOG: all server processes terminated; reinitializing
imply there was always postgres restart in case of segfault, would something change with systemd/init level restart?
Actually after looking at the coredump again this morning, it doesn't really make sense:
#0 0x00005619cf6a452d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x5619d1863060, values=values@entry=0x5619d1864290, isnull=isnull@entry=0x5619d18643a0)
at ./build/../src/backend/access/common/heaptuple.c:175
#1 0x00005619cf6a63fd in heap_form_minimal_tuple (tupleDescriptor=0x5619d1863060, values=0x5619d1864290, isnull=0x5619d18643a0) at ./build/../src/backend/access/common/heaptuple.c:1824
#2 0x00005619cfaeb9b0 in tuplestore_putvalues (state=0x5619d1864070, tdesc=<optimized out>, values=<optimized out>, isnull=<optimized out>) at ./build/../src/backend/utils/sort/tuplestore.c:756
#3 0x00007f515181f171 in pgqs_process_opexpr (expr=0x5619d18643a0, context=0x5619d1864290) at pg_qualstats.c:1644
#4 0x00005619cf845e0c in ExecMakeTableFunctionResult (setexpr=0x5619d15b4018, econtext=0x5619d15b3f38, argContext=<optimized out>, expectedDesc=<optimized out>, randomAccess=<optimized out>)
at ./build/../src/backend/executor/execSRF.c:231
#5 0x00005619cf8528c1 in FunctionNext (node=node@entry=0x5619d15b3e28) at ./build/../src/backend/executor/nodeFunctionscan.c:94
#6 0x00005619cf8451c9 in ExecScanFetch (recheckMtd=0x5619cf8525c0 <FunctionRecheck>, accessMtd=0x5619cf8525f0 <FunctionNext>, node=0x5619d15b3e28) at ./build/../src/backend/executor/execScan.c:95
pgqs_process_opexpr
should be called in ExecutorEnd(), not in the middle of execution. Also, pgqs_process_opexpr doesn't call tuplestore_putvalues(). I'm not sure what's wrong with this coredump but it's unfortunately not enough to understand where the problem is.
I installed the same environment with original debian packages, and I'm not able to reproduce the issue.
@akovac Thanks also for the report!
@akovac @seqizz Is there any chance that you can get reproduce the issue, get the stacktrace while having the -dbgsym packages installed? (you can check https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD if you need some pointers). Also, does the crash always happens when calling manually:
Ehmm.. after downgrading the package and drop/create extension again, I still get crashes with previous version. DB is also restarted, not sure what is going on.
I am not sure if this is sufficient, stacktrace from the local dump, maybe dbgsym on local helps: bt.txt
I'm also running postgres under GDB currently but I am unable to trigger the crash with manual commands, I'll try to upgrade the plugin on this one and report back with trace if crashes.
@seqizz I'm fine after downgrade - no crashes. Difference is that I didn't do drop/create extension after first upgrade or later downgrade :-/ @rjuju I'll be able to try and reproduce a bit later after 17:00 as the system is in production so I'll need to wait first for some processes to finish.
Aaah I've caught it at last. Here it is: bt2.txt
few lines missed in the beginning, in case it's helpful:
(gdb) cont
Continuing.
Thread 2.1 "postgres" hit Catchpoint 1 (signal SIGSEGV), 0x000055555561352d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x55555620d2f0,
values=values@entry=0x55555620e520, isnull=isnull@entry=0x55555620e630) at ./build/../src/backend/access/common/heaptuple.c:175
175 ./build/../src/backend/access/common/heaptuple.c: No such file or directory.
(gdb) bt
... (rest is in the file)
I can reliably reproduce with both select powa_take_snapshot() and select count(*) from pg_qualstats
After downgrading and restarting potgresql server everything is fine again.
@rjuju I didn't do drop/create of extension (CREATE EXTENSION pg_qualstats) Is that required for proper upgrade?
If I run that same select from pg_qualstats query on the read-only standby server it doesn't trigger segfault. Both servers are running inside unprivileged LXC (Debian 10 / PostgreSQL 12) without any updates pending (except downgraded qualstats).
Backtrace from select count(*) from pg_qualstats:
#0 0x000055b165e4119d in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x55b1673fb0e8, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/access/common/heaptuple.c:138
#1 0x000055b165e42b3c in heap_form_minimal_tuple (tupleDescriptor=0x55b1673fb0e8, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/access/common/heaptuple.c:1396
#2 0x000055b1662b0a10 in tuplestore_putvalues (state=state@entry=0x55b1673fb8f8, tdesc=<optimized out>, values=values@entry=0x55b1673fbb18, isnull=isnull@entry=0x55b1673fbc28) at ./build/../src/backend/utils/sort/tuplestore.c:756
#3 0x00007fd40396f20e in pg_qualstats_common (fcinfo=<optimized out>, include_names=<optimized out>) at pg_qualstats.c:2083
#4 0x000055b165fef6b5 in ExecMakeTableFunctionResult (setexpr=0x55b1673dfcd8, econtext=0x55b1673dfbd8, argContext=<optimized out>, expectedDesc=0x55b1673ed088, randomAccess=false) at ./build/../src/backend/executor/execSRF.c:233
#5 0x000055b165ffdf3d in FunctionNext (node=node@entry=0x55b1673df560) at ./build/../src/backend/executor/nodeFunctionscan.c:94
#6 0x000055b165feec09 in ExecScanFetch (recheckMtd=0x55b165ffdc10 <FunctionRecheck>, accessMtd=0x55b165ffdc40 <FunctionNext>, node=0x55b1673df560) at ./build/../src/backend/executor/execScan.c:133
#7 ExecScan (node=0x55b1673df560, accessMtd=0x55b165ffdc40 <FunctionNext>, recheckMtd=0x55b165ffdc10 <FunctionRecheck>) at ./build/../src/backend/executor/execScan.c:183
#8 0x000055b16600368f in ExecProcNode (node=0x55b1673df560) at ./build/../src/include/executor/executor.h:239
#9 ExecHashJoinImpl (parallel=false, pstate=0x55b1673df918) at ./build/../src/backend/executor/nodeHashjoin.c:262
#10 ExecHashJoin (pstate=0x55b1673df918) at ./build/../src/backend/executor/nodeHashjoin.c:591
#11 0x000055b165ff765c in ExecProcNode (node=0x55b1673df918) at ./build/../src/include/executor/executor.h:239
#12 fetch_input_tuple (aggstate=aggstate@entry=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:406
#13 0x000055b165ff91e0 in agg_retrieve_direct (aggstate=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:1747
#14 ExecAgg (pstate=0x55b1673df350) at ./build/../src/backend/executor/nodeAgg.c:1562
#15 0x000055b165fe6233 in ExecProcNode (node=0x55b1673df350) at ./build/../src/include/executor/executor.h:239
#16 ExecutePlan (execute_once=<optimized out>, dest=0x55b1673c0060, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55b1673df350, estate=0x55b1673df0f0) at ./build/../src/backend/executor/execMain.c:1646
#17 standard_ExecutorRun (queryDesc=0x55b1673b7b40, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:364
#18 0x00007fd4039810a5 in pgss_ExecutorRun (queryDesc=0x55b1673b7b40, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
#19 0x00007fd40396ecdd in pgqs_ExecutorRun (queryDesc=0x55b1673b7b40, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_qualstats.c:630
#20 0x000055b16614d16c in PortalRunSelect (portal=portal@entry=0x55b1672f3be0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x55b1673c0060) at ./build/../src/backend/tcop/pquery.c:929
#21 0x000055b16614e708 in PortalRun (portal=portal@entry=0x55b1672f3be0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x55b1673c0060, altdest=altdest@entry=0x55b1673c0060, completionTag=0x7ffcaaac6e90 "") at ./build/../src/backend/tcop/pquery.c:770
#22 0x000055b16614a4ac in exec_simple_query (query_string=0x55b16724cd90 "select count(*) from pg_qualstats;") at ./build/../src/backend/tcop/postgres.c:1215
#23 0x000055b16614bcfd in PostgresMain (argc=<optimized out>, argv=argv@entry=0x55b1672b6f58, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4247
#24 0x000055b1660d1549 in BackendRun (port=0x55b16729dac0, port=0x55b16729dac0) at ./build/../src/backend/postmaster/postmaster.c:4437
#25 BackendStartup (port=0x55b16729dac0) at ./build/../src/backend/postmaster/postmaster.c:4128
#26 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#27 0x000055b1660d246f in PostmasterMain (argc=7, argv=0x55b167246c40) at ./build/../src/backend/postmaster/postmaster.c:1377
#28 0x000055b165e38eed in main (argc=7, argv=0x55b167246c40) at ./build/../src/backend/main/main.c:228
Detaching from program: /usr/lib/postgresql/12/bin/postgres, process 18609
[Inferior 1 (process 18609) detached]
I forgot to mention that table is empty - this is when I revert to 1.0.9-1
prodpg@dbs1:~$ psql powa
psql (12.2 (Debian 12.2-2.pgdg100+1))
Type "help" for help.
powa=# select powa_take_snapshot();
powa_take_snapshot
--------------------
(1 row)
powa=# select count(*) from pg_qualstats;
count
-------
0
(1 row)
After dropping and recreating extension everything is fine. @seqizz Is it possible that you just pasted CREATE EXTENSION IF NOT EXISTS pg_qualstats; (without dropping extension first)
After dropping and recreating extension everything is fine. @seqizz Is it possible that you just pasted CREATE EXTENSION IF NOT EXISTS pg_qualstats; (without dropping extension first)
sadly i am sure that i reproduced the crash with following steps on pg11 (to get the last trace i posted)
Thanks for the extensive testing.
Yes, unfortunately pg_qualstats isn't always safe if have a discrepancy between the loaded shared library (which requires a postgres restart) and the SQL object definition, that is when the ABI change. Once we clear out if there's another problem with v2.0.1, I'll add safe guards for that, but unfortunately it'll only help when moving from v1.x / v2.0.1 to v2.0.2, but 1.X vs 2.0.1 will still be problematic.
@seqizz those steps only ensure the version of the SQL object definition. The ABI mismatch that can cause crash also depends on the loaded version. So even doing a package upgrade/downgrade won't change the version loaded in postgres until you restart postgres. If you didn't restart postgres, this is actually the way to cause the ABI mismatch.
@seqizz those steps only ensure the version of the SQL object definition. The ABI mismatch that can cause crash also depends on the loaded version. So even doing a package upgrade/downgrade won't change the version loaded in postgres until you restart postgres. If you didn't restart postgres, this is actually the way to cause the ABI mismatch.
Oh.. that hurts. I forgot about the hidden ABI mismatches. I remember having a similar experience with an upgrade of repmgr.
The thing is, we try to update extensions automatically via:
alter extension *extension_name* update;
And this is not allowed by message:
ERROR: extension "pg_qualstats" has no update path from version "1.0.9" to version "2.0.1"
So after seeing drop/create doesn't complain, I thought that was safe enough. But evidently not. IMHO this should be also checked/prevented on the create extension step, if possible.
Thanks for the explanation btw. I'll keep in mind to block package upgrade on the hosts which can't afford downtime. :+1:
So I just pushed a fix with https://github.com/powa-team/pg_qualstats/commit/326b55508ea4797e05445d66ff3969e2013346a4 based on pg_stat_statements implementation for API change. All cases now works as expected, except creating the new extension definition while still having the old lib loaded, which errors out rather than later crashing:
I also fixed pg_qualstats_names() which was also broken. If you have the possibility, could you double check that everything is now working as expected? If so I'll release a new version quickly.
Yep, looks like it's failing like expected:
ERROR: could not find function "pg_qualstats_2_0" in file "/usr/lib/postgresql/11/lib/pg_qualstats.so"
Thanks :+1:
Great, thanks a lot! I just pushed version 2.0.2!
Hi there :wave:
After an upgrade of qualstats (from 1.0.9 to 2.0.1), I've started to get segfaults which causes postgres to restart while powa taking snapshots.
Some environment info:
Error from postgres logs:
Meanwhile at kernel log:
(After the upgrade, I've manually executed drop extension + create extension for pg_qualstats under powa database to upgrade it to v2.0.1)
I couldn't find anything specific I need to apply on the upgrade, any suggestions?
Thanks in advance :+1: