Race condition between neon_read() and StartupXLOG()/redo loop leads to deadlock

The test_replica_query_race test fails sometimes with a statement timeout, e. g.: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9863/11991160666/index.html#/testresult/abae5a0a14db6afe https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9861/11978735655/index.html#/testresult/8dac1a69385b2308

asyncpg.exceptions.QueryCanceledError: canceling statement due to statement timeout

This issue can be reproduced just by running the test in a loop (running 10 instances in parallel speeds up the reproduction for me). I compiled binaries with CFLAGS="-O0 -DWAL_DEBUG", added

config_lines = ["wal_debug = on", "log_min_messages = DEBUG3", "log_min_error_statement = log", "log_connections = on", "log_disconnections = on", "log_line_prefix = '%m [%p:%l] %q%a '", "log_statement = 'all'"]

to the test endpoints' configuration, and ran:

DEFAULT_PG_VERSION=14 BUILD_TYPE=release USE_LFC=false poetry run pytest test_runner/regress/test_hot_standby.py::test_replica_query_race[release-pg14]

With the statement timeout and the test timeout increased, I can see the following:

ITERATION 142
Sun Dec  1 08:55:00 UTC 2024
8       ======================== 1 passed, 1 warning in 18.54s =========================
10      ======================== 1 passed, 1 warning in 18.53s =========================
4       ======================== 1 passed, 1 warning in 19.04s =========================
3       ======================== 1 passed, 1 warning in 19.04s =========================
6       ======================== 1 passed, 1 warning in 19.04s =========================
1       ======================== 1 passed, 1 warning in 19.54s =========================
7       ======================== 1 passed, 1 warning in 19.54s =========================
5       ======================== 1 passed, 1 warning in 19.54s =========================
9       ======================== 1 passed, 1 warning in 20.04s =========================
2       ======================== 1 passed, 1 warning in 20.04s =========================
ITERATION 143
Sun Dec  1 08:55:21 UTC 2024
... hanging ...

test_output/test_replica_query_race_7[release-pg14]/repo/endpoints/standby/compute.log contains (with my comments added):

PG:2024-12-01 08:55:41.997 GMT [3694308:18687] LOG:  REDO @ 0/1810AB8; LSN 0/1810AE8: prev 0/1810A68; xid 9869; len 20 - Transaction/COMMIT: 2024-12-01 08:55:41.948542+00
PG:2024-12-01 08:55:41.997 GMT [3694308:18688] LOG:  REDO @ 0/1810AE8; LSN 0/1810CB8: prev 0/1810AB8; xid 0; len 8; blkref #0: rel 1663/13014/16397, blk 0 - Heap2/PRUNE: latestRemovedXid 9869 nredirected 1 ndead 0
PG:2024-12-01 08:55:41.997 GMT [3694314:28601] DEBUG:  sending write 0/18120D0 flush 0/18120D0 apply 0/1810AE8
PG:2024-12-01 08:55:41.997 GMT [3694308:18689] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/1810AE8, not_modified_since 0/16E33E8
PG:2024-12-01 08:55:41.997 GMT [3694335:6609] [unknown] LOG:  statement: SELECT clear_buffer_cache()
PG:2024-12-01 08:55:41.997 GMT [3694335:6610] [unknown] DEBUG:  updated min recovery point to 0/1810CB8 on timeline 1
PG:2024-12-01 08:55:41.997 GMT [3694335:6611] [unknown] CONTEXT:  writing block 0 of relation base/13014/16397
PG:2024-12-01 08:55:41.998 GMT [3694335:6612] [unknown] DEBUG:  bind <unnamed> to __asyncpg_stmt_1__
PG:2024-12-01 08:55:41.998 GMT [3694335:6613] [unknown] LOG:  execute __asyncpg_stmt_1__: SELECT counter FROM test
    ### standby_workload from the test executes a query
PG:2024-12-01 08:55:41.998 GMT [3694335:6614] [unknown] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/1810CB8, not_modified_since 0/1810CB8
    ### the query executing backend wants to get the page needed at LSN 0/1810CB8
PG:2024-12-01 08:55:41.998 GMT [3694308:18690] LOG:  REDO @ 0/1810CB8; LSN 0/1810D08: prev 0/1810AE8; xid 9870; len 18; blkref #0: rel 1663/13014/16397, blk 0 - Heap/HOT_UPDATE: off 205 xmax 9870 flags 0x10 ; new off 2 xmax 0
    ### this LSN is replayed just after that inside StartupXLOG(), see the corresponding code below
PG:2024-12-01 08:55:41.998 GMT [3694314:28602] DEBUG:  sendtime 2024-12-01 08:55:41.998248+00 receipttime 2024-12-01 08:55:41.998347+00 replication apply delay 61 ms transfer latency 1 ms
PG:2024-12-01 08:55:41.998 GMT [3694314:28603] DEBUG:  sending write 0/1812150 flush 0/18120D0 apply 0/1810CB8
PG:2024-12-01 08:55:41.998 GMT [3694314:28604] DEBUG:  sending write 0/1812150 flush 0/1812150 apply 0/1810CB8
    ### now these two processes are holding each other
...
PG:2024-12-01 08:55:52.008 GMT [3694335:6615] [unknown] LOG:  Waiting for recovery to catch up to 0/1810CB8
PG:2024-12-01 08:55:52.008 GMT [3694335:6616] [unknown] STATEMENT:  SELECT counter FROM test
    ### the query executing backend logs messages once in 10 seconds, inside XLogWaitForReplayOf()
...
PG:2024-12-01 08:56:02.009 GMT [3694335:6617] [unknown] LOG:  Waiting for recovery to catch up to 0/1810CB8
PG:2024-12-01 08:56:02.009 GMT [3694335:6618] [unknown] STATEMENT:  SELECT counter FROM test
...

Running processes of the standby instance:

law      3694300 3694266  0 08:55 pts/3    00:00:00 .../pg_install/v14/bin/postgres -D .../test_output/test_replica_query_race_7[release-pg14]/repo/endpoints/standby/pgdata
law      3694308 3694300  0 08:55 ?        00:00:00 postgres: startup recovering 000000010000000000000001
law      3694311 3694300  0 08:55 ?        00:00:00 postgres: checkpointer 
law      3694312 3694300  0 08:55 ?        00:00:00 postgres: background writer 
law      3694313 3694300  0 08:55 ?        00:00:00 postgres: stats collector 
law      3694314 3694300  0 08:55 ?        00:00:01 postgres: walreceiver streaming 0/1854F28
law      3694315 3694300  0 08:55 ?        00:00:03 postgres: cloud_admin postgres 127.0.0.1(44306) idle
law      3694335 3694300  0 08:55 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(44360) SELECT
law      3694336 3694081  0 08:55 ?        00:00:01 postgres: cloud_admin postgres 127.0.0.1(51892) idle
law      3694785 3691682  0 09:04 ?        00:00:00 postgres: law storage_controller ::1(36894) idle

gdb sessions show:

Attaching to process 3694335

(gdb) bt
#0  0x00007a8e0672a007 in epoll_wait (epfd=10, events=0x6350740e8738, maxevents=1, timeout=10000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00006350721c5012 in WaitEventSetWaitBlock (set=0x6350740e86d8, cur_timeout=10000, occurred_events=0x7fff293671f0, nevents=1) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1477
#2  0x00006350721c4eb0 in WaitEventSetWait (set=0x6350740e86d8, timeout=10000, occurred_events=0x7fff293671f0, nevents=1, wait_event_info=83886088)
    at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1423
#3  0x00006350721c42b4 in WaitLatch (latch=0x7a8e04f5cc84, wakeEvents=41, timeout=10000, wait_event_info=83886088) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:494
#4  0x00006350721d4bba in ConditionVariableTimedSleep (cv=0x7a8e041045ec, timeout=10000, wait_event_info=83886088) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:163
#5  0x0000635071df12d0 in XLogWaitForReplayOf (redoEndRecPtr=25234616) at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:5392
#6  0x00007a8e06f67486 in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=0, request_lsns=0x7fff293673e0, buffers=0x7fff293673b0, nblocks=1, mask=0x0) at .../pgxn/neon/pagestore_smgr.c:2832
#7  0x00007a8e06f678bf in neon_read_at_lsn (rinfo=..., forkNum=MAIN_FORKNUM, blkno=0, request_lsns=..., buffer=0x7a8e04b20d80 "") at .../pgxn/neon/pagestore_smgr.c:2950
#8  0x00007a8e06f67aed in neon_read (reln=0x6350741f2d98, forkNum=MAIN_FORKNUM, blkno=0, buffer=0x7a8e04b20d80 "") at .../pgxn/neon/pagestore_smgr.c:2992
#9  0x00006350721f1058 in smgrread (reln=0x6350741f2d98, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7a8e04b20d80 "") at .../vendor/postgres-v14/src/backend/storage/smgr/smgr.c:515
#10 0x00006350721ade78 in ReadBuffer_common (smgr=0x6350741f2d98, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff293675d3)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1023
#11 0x00006350721ad568 in ReadBufferExtended (reln=0x7a8e040d5948, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:776
#12 0x0000635071d67c8c in heapgetpage (sscan=0x63507420c760, page=0) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:533
#13 0x0000635071d68f37 in heapgettup_pagemode (scan=0x63507420c760, dir=ForwardScanDirection, nkeys=0, key=0x0) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:1032
#14 0x0000635071d6a090 in heap_getnextslot (sscan=0x63507420c760, direction=ForwardScanDirection, slot=0x63507420c160) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:1529
#15 0x0000635071ffc3a8 in table_scan_getnextslot (sscan=0x63507420c760, direction=ForwardScanDirection, slot=0x63507420c160) at .../vendor/postgres-v14/src/include/access/tableam.h:1044
#16 0x0000635071ffc474 in SeqNext (node=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/nodeSeqscan.c:80
#17 0x0000635071fc1591 in ExecScanFetch (node=0x63507420bf70, accessMtd=0x635071ffc3dc <SeqNext>, recheckMtd=0x635071ffc485 <SeqRecheck>) at .../vendor/postgres-v14/src/backend/executor/execScan.c:132
#18 0x0000635071fc160a in ExecScan (node=0x63507420bf70, accessMtd=0x635071ffc3dc <SeqNext>, recheckMtd=0x635071ffc485 <SeqRecheck>) at .../vendor/postgres-v14/src/backend/executor/execScan.c:181
#19 0x0000635071ffc4d1 in ExecSeqScan (pstate=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/nodeSeqscan.c:112
#20 0x0000635071fbd6f7 in ExecProcNodeFirst (node=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/execProcnode.c:464
#21 0x0000635071fb2326 in ExecProcNode (node=0x63507420bf70) at .../vendor/postgres-v14/src/include/executor/executor.h:260
#22 0x0000635071fb498e in ExecutePlan (estate=0x63507420bd48, planstate=0x63507420bf70, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=1, direction=ForwardScanDirection, dest=0x63507412e3b8, 
    execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:1560
#23 0x0000635071fb28b6 in standard_ExecutorRun (queryDesc=0x635074216e28, direction=ForwardScanDirection, count=1, execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:364
#24 0x0000635071fb273d in ExecutorRun (queryDesc=0x635074216e28, direction=ForwardScanDirection, count=1, execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:308
#25 0x00006350721fb9dc in PortalRunSelect (portal=0x6350741bf7a8, forward=true, count=1, dest=0x63507412e3b8) at .../vendor/postgres-v14/src/backend/tcop/pquery.c:921
#26 0x00006350721fb644 in PortalRun (portal=0x6350741bf7a8, count=1, isTopLevel=true, run_once=false, dest=0x63507412e3b8, altdest=0x63507412e3b8, qc=0x7fff29367c70)
    at .../vendor/postgres-v14/src/backend/tcop/pquery.c:765
#27 0x00006350721f6391 in exec_execute_message (portal_name=0x63507412df48 "", max_rows=1) at .../vendor/postgres-v14/src/backend/tcop/postgres.c:2217
#28 0x00006350721f991c in PostgresMain (argc=1, argv=0x7fff29367ec0, dbname=0x635074189990 "postgres", username=0x635074189970 "cloud_admin") at .../vendor/postgres-v14/src/backend/tcop/postgres.c:4672
#29 0x00006350721367e9 in BackendRun (port=0x635074181540) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:4541
#30 0x0000635072136044 in BackendStartup (port=0x635074181540) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:4263
#31 0x0000635072132068 in ServerLoop () at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1748
#32 0x00006350721317d2 in PostmasterMain (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1420
#33 0x00006350720336ec in main (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/main/main.c:249

(gdb) f 5
#5  0x0000635071df12d0 in XLogWaitForReplayOf (redoEndRecPtr=25234616) at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:5392
5392                    timeout = ConditionVariableTimedSleep(&XLogCtl->replayProgressCV,

(gdb) p/x redoEndRecPtr
$1 = 0x1810cb8

(gdb) f 10
#10 0x00006350721ade78 in ReadBuffer_common (smgr=0x6350741f2d98, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff293675d3)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1023
1023                            smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
(gdb) p *bufHdr
$2 = {tag = {rnode = {spcNode = 1663, dbNode = 13014, relNode = 16397}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 62, state = {value = 2248671234}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 57, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}

Attaching to process 3694308

(gdb) bt
#0  0x00007a8e0672a007 in epoll_wait (epfd=9, events=0x6350740e8738, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00006350721c5012 in WaitEventSetWaitBlock (set=0x6350740e86d8, cur_timeout=-1, occurred_events=0x7fff29364de0, nevents=1) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1477
#2  0x00006350721c4eb0 in WaitEventSetWait (set=0x6350740e86d8, timeout=-1, occurred_events=0x7fff29364de0, nevents=1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1423
#3  0x00006350721c42b4 in WaitLatch (latch=0x7a8e04f61f04, wakeEvents=33, timeout=-1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:494
#4  0x00006350721d4bba in ConditionVariableTimedSleep (cv=0x7a8e04ba5160, timeout=-1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:163
#5  0x00006350721d4b19 in ConditionVariableSleep (cv=0x7a8e04ba5160, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:100
#6  0x00006350721b2e4c in WaitIO (buf=0x7a8e04aa3d00) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4514
#7  0x00006350721b2eaf in StartBufferIO (buf=0x7a8e04aa3d00, forInput=true) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4551
#8  0x00006350721ae3a7 in BufferAlloc (smgr=0x6350741c87e8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, strategy=0x0, foundPtr=0x7fff29364ff1)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1173
#9  0x00006350721ad8aa in ReadBuffer_common (smgr=0x6350741c87e8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff2936512f)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:872
#10 0x00006350721ad639 in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:804
#11 0x0000635071e094be in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=0, mode=RBM_NORMAL) at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:483
#12 0x0000635071e09322 in XLogReadBufferForRedoExtended (record=0x63507418e228, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=false, buf=0x7fff2936526c)
    at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:410
#13 0x0000635071e08f30 in XLogReadBufferForRedo (record=0x63507418e228, block_id=0 '\000', buf=0x7fff2936526c) at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:297
#14 0x0000635071d77f97 in heap_xlog_update (record=0x63507418e228, hot_update=true) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:9832
#15 0x0000635071d796ca in heap_redo (record=0x63507418e228) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:10271
#16 0x0000635071df6063 in StartupXLOG () at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:7830
#17 0x00006350721380d3 in StartupProcessMain () at .../vendor/postgres-v14/src/backend/postmaster/startup.c:259
#18 0x0000635071e0e582 in AuxiliaryProcessMain (argc=2, argv=0x7fff29367ff0) at .../vendor/postgres-v14/src/backend/bootstrap/bootstrap.c:458
#19 0x0000635072136cfb in StartChildProcess (type=StartupProcess) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:5485
#20 0x00006350721317ae in PostmasterMain (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1412
#21 0x00006350720336ec in main (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/main/main.c:249

(gdb) f 7
#7  0x00006350721b2eaf in StartBufferIO (buf=0x7a8e04aa3d00, forInput=true) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4551
4551                    WaitIO(buf);
(gdb) p *buf
$1 = {tag = {rnode = {spcNode = 1663, dbNode = 13014, relNode = 16397}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 62, state = {value = 2248671234}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 57, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}

That is, the client backend called ReadBuffer_common, where we have the following:

        /*
         * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
         * not currently in memory.
         */
        bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
                             strategy, &found);

so, it set IO_IN_PROGRESS for the buffer 16397/0, then it called smgrread() -> neon_read() and chose to wait for LSN 0/1810CB8 (sleeping on replayProgressCV)), which was not replayed at that moment.

Then the recovery process reads the record needed (REDO @ 0/1810CB8), logs a message about it and then calls heap_redo(), which tries to apply changes to the same buffer, that is marked as IO_IN_PROGRESS by the client backend. The corresponding StartupXLOG( code:

            /*
             * main redo apply loop
             */
            do
            {
...
                {
...
                    appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
                                     LSN_FORMAT_ARGS(ReadRecPtr),
                                     LSN_FORMAT_ARGS(EndRecPtr));
...
                }
...
                /* Now apply the WAL record itself */
                RmgrTable[record->xl_rmid].rm_redo(xlogreader);
...
                ConditionVariableBroadcast(&XLogCtl->replayProgressCV);
...

Thus, test_replica_query_race is flaky because of yet another race condition.

neondatabase / neon

Race condition between neon_read() and StartupXLOG()/redo loop leads to deadlock #9955