Closed lkshminarayanan closed 1 year ago
This is now reproducible locally and issue still occurs in 15.1.
This is a deadlock between the worker process and another process running the DROP DATABASE
command.
W.r.to the logs mentioned in the bug description, all the DROP DATABASE
processes are waiting on the worker process (with pid 24834) and that worker process itself is stuck waiting for the CREATE REPLICATION SLOT
command issued by it to complete.
The CREATE REPLICATION SLOT
is executed by a different process (pid 24835) and it is stuck waiting for acquiring a lock on a particular transactionid - i.e. - waiting for another transaction to complete. That transaction turned out to be the DROP DATABASE
being executed by the stuck process with pid 24872.
TLDR;
24872 -> executing DROP DATABASE
and waiting for 24834 to accept new signal barrier generation
24834 -> sent a CREATE REPLICATION SLOT
command and waiting for it to complete
24835 -> executing the CREATE REPLICATION SLOT
command but waiting for transaction in 24872 to complete.
The wait done by process 24834
is a bit problematic IMO :
The process uses libpqrcv_PQexec
to send the CREATE REPLICATION SLOT
command. The libpqrcv_PQexec
call is supposed to be non blocking according to the comments there but it is not able to accept the new ProcSignalBarrier interrupt when it is waiting for CREATE REPLICATION SLOT
command to complete.
The exact place where the wait happens is this - libpqwalreceiver.c#L696. The WaitLatchOrSocket
is supposed to handle interrupts and so it should have accepted the new ProcSignalBarrier but that never happens.
Relevant stacktrace of this process :
(gdb) bt
#0 0x00007f1c8da39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x000055a2d0639a17 in WaitEventSetWaitBlock (set=0x55a2d233a8d0, cur_timeout=-1, occurred_events=0x7fff16d198b0, nevents=1)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:1489
#2 0x000055a2d06397c7 in WaitEventSetWait (set=0x55a2d233a8d0, timeout=-1, occurred_events=0x7fff16d198b0, nevents=1, wait_event_info=100663300)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:1435
#3 0x000055a2d0638343 in WaitLatchOrSocket (latch=0x7f1c8cefe2c4, wakeEvents=35, sock=6, timeout=-1, wait_event_info=100663300)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:562
#4 0x00007f1c8e09c1df in libpqrcv_PQgetResult (streamConn=0x55a2d233c1a0)
at /postgres/sources/15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:708
#5 0x00007f1c8e09c02b in libpqrcv_PQexec (streamConn=0x55a2d233c1a0,
query=0x55a2d233ac18 "CREATE_REPLICATION_SLOT \"pg_255220_sync_222763_7169288270839769474\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /postgres/sources/15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:668
#6 0x00007f1c8e09d0da in libpqrcv_create_slot (conn=0x55a2d228cd50, slotname=0x55a2d228ccf8 "pg_255220_sync_222763_7169288270839769474", temporary=false, two_phase=false,
snapshot_action=CRS_USE_SNAPSHOT, lsn=0x7fff16d19ba8) at /postgres/sources/15.1/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c:944
#7 0x000055a2d059de71 in LogicalRepSyncTableStart (origin_startpos=0x7fff16d19ba8) at /postgres/sources/15.1/src/backend/replication/logical/tablesync.c:1353
#8 0x000055a2d05a9cb0 in start_table_sync (origin_startpos=0x7fff16d19ba8, myslotname=0x7fff16d19bb0)
at /postgres/sources/15.1/src/backend/replication/logical/worker.c:3525
#9 0x000055a2d05aa712 in ApplyWorkerMain (main_arg=1) at /postgres/sources/15.1/src/backend/replication/logical/worker.c:3688
#10 0x000055a2d0525cb1 in StartBackgroundWorker () at /postgres/sources/15.1/src/backend/postmaster/bgworker.c:858
#11 0x000055a2d053ad6f in do_start_bgworker (rw=0x55a2d2279d50) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:5823
#12 0x000055a2d053b50b in maybe_start_bgworkers () at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:6047
#13 0x000055a2d05392ae in sigusr1_handler (postgres_signal_arg=10) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:5204
#14 <signal handler called>
#15 0x00007f1c8da2e804 in select () from /usr/lib/libc.so.6
#16 0x000055a2d05300f4 in ServerLoop () at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:1770
#17 0x000055a2d052f570 in PostmasterMain (argc=8, argv=0x55a2d21d64f0) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:1478
#18 0x000055a2d037b1bc in main (argc=8, argv=0x55a2d21d64f0) at /postgres/sources/15.1/src/backend/main/main.c:202
The place where the process executing CREATE REPLICATION SLOT
(pid 24835) is waiting for another transaction to complete is the SnapBuildWaitSnapshot function. The function waits for all the transaction with transactionid < cutoff to complete. In this case, it waits for the DROP DATABASE
transaction to complete.
Relevant stacktrace of this wait :
(gdb) bt
#0 0x00007f1c8da39057 in epoll_wait () from /usr/lib/libc.so.6
#1 0x000055a2d0639a17 in WaitEventSetWaitBlock (set=0x55a2d21d7638, cur_timeout=-1, occurred_events=0x7fff16d1a0b0, nevents=1)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:1489
#2 0x000055a2d06397c7 in WaitEventSetWait (set=0x55a2d21d7638, timeout=-1, occurred_events=0x7fff16d1a0b0, nevents=1, wait_event_info=50331653)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:1435
#3 0x000055a2d0638009 in WaitLatch (latch=0x7f1c8cf02e64, wakeEvents=33, timeout=0, wait_event_info=50331653)
at /postgres/sources/15.1/src/backend/storage/ipc/latch.c:497
#4 0x000055a2d06830c6 in ProcSleep (locallock=0x55a2d2268958, lockMethodTable=0x55a2d0d5aac0 <default_lockmethod>)
at /postgres/sources/15.1/src/backend/storage/lmgr/proc.c:1337
#5 0x000055a2d0663639 in WaitOnLock (locallock=0x55a2d2268958, owner=0x55a2d22a0d38) at /postgres/sources/15.1/src/backend/storage/lmgr/lock.c:1859
#6 0x000055a2d0661389 in LockAcquireExtended (locktag=0x7fff16d1a4b0, lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true, locallockp=0x0)
at /postgres/sources/15.1/src/backend/storage/lmgr/lock.c:1101
#7 0x000055a2d065ffa2 in LockAcquire (locktag=0x7fff16d1a4b0, lockmode=5, sessionLock=false, dontWait=false)
at /postgres/sources/15.1/src/backend/storage/lmgr/lock.c:752
#8 0x000055a2d065d9b1 in XactLockTableWait (xid=19277, rel=0x0, ctid=0x0, oper=XLTW_None) at /postgres/sources/15.1/src/backend/storage/lmgr/lmgr.c:702
#9 0x000055a2d0596c89 in SnapBuildWaitSnapshot (running=0x55a2d23b24f8, cutoff=19278) at /postgres/sources/15.1/src/backend/replication/logical/snapbuild.c:1511
#10 0x000055a2d05969bb in SnapBuildFindSnapshot (builder=0x55a2d23ac550, lsn=3454445384, running=0x55a2d23b24f8)
at /postgres/sources/15.1/src/backend/replication/logical/snapbuild.c:1447
#11 0x000055a2d0595e99 in SnapBuildProcessRunningXacts (builder=0x55a2d23ac550, lsn=3454445384, running=0x55a2d23b24f8)
at /postgres/sources/15.1/src/backend/replication/logical/snapbuild.c:1198
#12 0x000055a2d056a9d5 in standby_decode (ctx=0x55a2d2369de0, buf=0x7fff16d1a620) at /postgres/sources/15.1/src/backend/replication/logical/decode.c:346
#13 0x000055a2d0569eef in LogicalDecodingProcessRecord (ctx=0x55a2d2369de0, record=0x55a2d236a238)
at /postgres/sources/15.1/src/backend/replication/logical/decode.c:119
#14 0x000055a2d0572364 in DecodingContextFindStartpoint (ctx=0x55a2d2369de0) at /postgres/sources/15.1/src/backend/replication/logical/logical.c:613
#15 0x000055a2d05cc2ba in CreateReplicationSlot (cmd=0x55a2d23687c0) at /postgres/sources/15.1/src/backend/replication/walsender.c:1152
#16 0x000055a2d05ce514 in exec_replication_command (cmd_string=0x55a2d221ce40 "CREATE_REPLICATION_SLOT \"pg_255220_sync_222763_7169288270839769474\" LOGICAL pgoutput (SNAPSHOT 'use')")
at /postgres/sources/15.1/src/backend/replication/walsender.c:1820
#17 0x000055a2d069f98e in PostgresMain (dbname=0x55a2d2287d38 "db_dist_move_chunk_1", username=0x55a2d2287d00 "cluster_super_user")
at /postgres/sources/15.1/src/backend/tcop/postgres.c:4577
#18 0x000055a2d0538ec2 in BackendRun (port=0x55a2d227a1f0) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:4504
#19 0x000055a2d0538212 in BackendStartup (port=0x55a2d227a1f0) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:4232
#20 0x000055a2d05303c1 in ServerLoop () at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:1806
#21 0x000055a2d052f570 in PostmasterMain (argc=8, argv=0x55a2d21d64f0) at /postgres/sources/15.1/src/backend/postmaster/postmaster.c:1478
#22 0x000055a2d037b1bc in main (argc=8, argv=0x55a2d21d64f0) at /postgres/sources/15.1/src/backend/main/main.c:202
Excellent analysis. As discussed this looks like a PG internal issue. Maybe this needs to be discussed in the community as well.
The exact place where the wait happens is this - libpqwalreceiver.c#L696. The
WaitLatchOrSocket
is supposed to handle interrupts and so it should have accepted the new ProcSignalBarrier but that never happens.
WaitLatchOrSocket
doesn't handle the interrupt because it is explicitly prevented from doing so - refer the comment before walrcv_create_slot. So the WaitLatchOrSocket
not handling the interrupt is an expected behaviour and this hang occurs just due to an unfortunate race condition.
I tried reproducing the hang with vanilla postgres by creating a subscription and dropping the database but I was never able to get the timing right. But just from the code inspection, one can confirm the possibility of this deadlock.
The following SQL statements are equivalent to what dist_move_chunk and another testcase execute to get into a hang :
Run the following in the publisher node :
CREATE TABLE t1 (c1 int);
INSERT INTO t1 VALUES (1), (2);
CREATE PUBLICATION pb_sb FOR TABLE t1;
SELECT pg_create_logical_replication_slot('pb_sb', 'pgoutput');
Run the following in the subscriber node :
CREATE TABLE t1 (c1 int);
CREATE DATABASE dummy;
Now execute the following two queries parallely on the subscriber node :
CREATE SUBSCRIPTION pb_sb
CONNECTION 'host=localhost port=5432 dbname=postgres user=postgres'
PUBLICATION pb_sb WITH (create_slot=false);
and
DROP DATABASE dummy
The CREATE SUBSCRIPTION
internally executes a CREATE_REPLICATION SLOT
and the process executing that gets into a deadlock with the process executing DROP DATABASE
.
Is this related? https://github.com/timescale/timescaledb/issues/4958
Is this related? #4958
@horzsolt I don't think so. The graceful exit issue is due to the usage of blocking PostgreSQL APIs in timescale code base but this issue is caused by the new ProcSignalBarrier
mechanism in PG15 DROP DATABASE
execution.
Looks like this is fixed by https://github.com/timescale/timescaledb/pull/5058
Hi @erimatnor
I ran the test locally with the changes from #5058 (commit hash 845b78096c). The test still hangs for me. The problem occurs only when the dist_move_chunk
is run in parallel with other tests. I use make -k regresscheck-t
to run them.
Relevant log that keeps repeating in postmaster :
2022-12-21 18:33:08.028 IST cluster_super_user [16445] db_dist_ddl LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.028 IST cluster_super_user [16445] db_dist_ddl LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.028 IST cluster_super_user [16445] db_dist_ddl STATEMENT: DROP DATABASE db_dist_ddl_3;
2022-12-21 18:33:08.031 IST cluster_super_user [16384] db_dist_grant LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.031 IST cluster_super_user [16384] db_dist_grant LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.031 IST cluster_super_user [16384] db_dist_grant STATEMENT: DROP DATABASE db_dist_grant_3;
2022-12-21 18:33:08.034 IST cluster_super_user [16503] db_dist_backup LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.034 IST cluster_super_user [16503] db_dist_backup LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.034 IST cluster_super_user [16503] db_dist_backup STATEMENT: DROP DATABASE db_dist_backup_2;
2022-12-21 18:33:08.036 IST cluster_super_user [16554] db_dist_copy_long LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.036 IST cluster_super_user [16554] db_dist_copy_long LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.036 IST cluster_super_user [16554] db_dist_copy_long STATEMENT: DROP DATABASE db_dist_copy_long_2;
2022-12-21 18:33:08.037 IST super_user [16184] db_dist_commands LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.037 IST super_user [16184] db_dist_commands LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.037 IST super_user [16184] db_dist_commands STATEMENT: DROP DATABASE db_dist_commands_3;
2022-12-21 18:33:08.262 IST cluster_super_user [16559] db_dist_copy_available_dns LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.262 IST cluster_super_user [16559] db_dist_copy_available_dns LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.262 IST cluster_super_user [16559] db_dist_copy_available_dns STATEMENT: DROP DATABASE db_dist_copy_available_dns_2;
2022-12-21 18:33:08.270 IST cluster_super_user [16472] db_dist_copy_format_long LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.270 IST cluster_super_user [16472] db_dist_copy_format_long LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.270 IST cluster_super_user [16472] db_dist_copy_format_long STATEMENT: DROP DATABASE db_dist_copy_format_long_3;
2022-12-21 18:33:08.331 IST super_user [19007] postgres LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.331 IST super_user [19007] postgres LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.331 IST super_user [19007] postgres STATEMENT: DROP DATABASE "db_dist_move_chunk";
2022-12-21 18:33:08.341 IST cluster_super_user [16387] db_dist_cagg LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.341 IST cluster_super_user [16387] db_dist_cagg LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.341 IST cluster_super_user [16387] db_dist_cagg STATEMENT: DROP DATABASE db_dist_cagg_3;
2022-12-21 18:33:08.658 IST cluster_super_user [16393] db_dist_compression LOG: 00000: still waiting for backend with PID 16577 to accept ProcSignalBarrier
2022-12-21 18:33:08.658 IST cluster_super_user [16393] db_dist_compression LOCATION: WaitForProcSignalBarrier, procsignal.c:418
2022-12-21 18:33:08.658 IST cluster_super_user [16393] db_dist_compression STATEMENT: DROP DATABASE db_dist_compression_3;
commit_hash and pg_stat_activity output:
db_dist_move_chunk_1=# SELECT commit_hash, commit_time FROM _timescaledb_internal.get_git_commit();
commit_hash | commit_time
------------------------------------------+------------------------------
845b78096c5cf5bd55a098a66558a87a6e938338 | Fri Dec 16 07:39:02 2022 PST
(1 row)
db_dist_move_chunk_1=# select pid, datname, application_name, query from pg_stat_activity order by pid;
pid | datname | application_name | query
-------+------------------------------+-----------------------------------------+---------------------------------------------------------------------------------------------------------
12099 | | |
12100 | | |
12102 | | |
12103 | | TimescaleDB Background Worker Launcher |
12104 | | |
15903 | db_dist_param_1 | TimescaleDB Background Worker Scheduler |
15954 | db_dist_move_chunk_1 | TimescaleDB Background Worker Scheduler |
16024 | db_dist_move_chunk_2 | TimescaleDB Background Worker Scheduler |
16080 | db_dist_copy_long_3 | TimescaleDB Background Worker Scheduler |
16184 | db_dist_commands | pg_regress/dist_commands | DROP DATABASE db_dist_commands_3;
16384 | db_dist_grant | pg_regress/dist_grant-15 | DROP DATABASE db_dist_grant_3;
16387 | db_dist_cagg | pg_regress/dist_cagg | DROP DATABASE db_dist_cagg_3;
16393 | db_dist_compression | pg_regress/dist_compression | DROP DATABASE db_dist_compression_3;
16394 | db_dist_move_chunk_3 | TimescaleDB Background Worker Scheduler |
16445 | db_dist_ddl | pg_regress/dist_ddl | DROP DATABASE db_dist_ddl_3;
16472 | db_dist_copy_format_long | pg_regress/dist_copy_format_long | DROP DATABASE db_dist_copy_format_long_3;
16503 | db_dist_backup | pg_regress/dist_backup | DROP DATABASE db_dist_backup_2;
16506 | db_dist_backup_3 | timescaledb | SELECT count(*) FROM pg_prepared_statements
16521 | db_dist_grant_4 | TimescaleDB Background Worker Scheduler |
16545 | db_dist_copy_available_dns_3 | TimescaleDB Background Worker Scheduler |
16554 | db_dist_copy_long | pg_regress/dist_copy_long | DROP DATABASE db_dist_copy_long_2;
16559 | db_dist_copy_available_dns | pg_regress/dist_copy_available_dns | DROP DATABASE db_dist_copy_available_dns_2;
16577 | db_dist_move_chunk_2 | |
16578 | db_dist_move_chunk_1 | pg_regress | CREATE_REPLICATION_SLOT "pg_303875_sync_246688_7179520409272532785" LOGICAL pgoutput (SNAPSHOT 'use')
19007 | postgres | pg_regress/dist_move_chunk | DROP DATABASE "db_dist_move_chunk";
24462 | db_dist_move_chunk_2 | |
24466 | db_dist_move_chunk_1 | pg_regress | START_REPLICATION SLOT "ts_copy_1_1" LOGICAL 0/0 (proto_version '3', publication_names '"ts_copy_1_1"')
26751 | postgres | TimescaleDB Background Worker Scheduler |
27156 | db_dist_move_chunk_1 | psql | select pid, datname, application_name, query from pg_stat_activity order by pid;
(29 rows)
db_dist_move_chunk_1=#
I still think this is not a Timescale issue rather a PostgreSQL upstream issue. There is a deadlock between a worker thread that is trying to create a replication slot and another process that is executing DROP DATABASE
.
I was able to reproduce the issue in vanilla postgres with the following queries :
-- Setup publisher database and tables
CREATE DATABASE db_publisher;
\c db_publisher
CREATE TABLE test_tab (a int PRIMARY KEY, b varchar);
INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar');
-- Create publication
CREATE PUBLICATION test_publication FOR TABLE test_tab;
SELECT pg_create_logical_replication_slot('test_subscription', 'pgoutput');
-- Create dummy database that is to be dropped at last
CREATE DATABASE dummy;
-- Setup subscriber database and table
CREATE DATABASE db_subscriber;
\c db_subscriber
CREATE TABLE test_tab (a int PRIMARY KEY, b varchar);
-- Create the subscription
CREATE SUBSCRIPTION test_subscription
CONNECTION 'dbname=db_publisher user=postgres'
PUBLICATION test_publication WITH (create_slot=false);
-- Drop the dummy database
-- This will hang due to a deadlock between the logical replication workers and process executing the DROP
DROP DATABASE dummy;
This confirms that the issue is not TimescaleDB specific. I have sent a bug report to upstream with all the details.
W.r.to TimescaleDB, this is not a serious issue as the bug will occur only if the two data nodes are running in the same postgres instance - which is the case in the dist_move_chunk
and other testcases but not the common practice in production.
This is now fixed : postgres/postgres@6c6d6ba and will probably be included in PG15.2.
What type of bug is this?
Other
What subsystems and features are affected?
Data node, Distributed hypertable
What happened?
When compiled and run in PG15, the
dist_move_chunk
testcase hangs due to a deadlock caused between the copy_chunk function and a DROP DATABASE command run in parallel. This is not locally reproducible but hangs in Github CI.TimescaleDB version affected
2.9.0
PostgreSQL version used
15.0
What operating system did you use?
Ubuntu 22.04
What installation method did you use?
Source
What platform did you run on?
Not applicable
Relevant log output and stack trace
How can we reproduce the bug?