neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.33k stars 447 forks source link

Race condition between neon_read() and StartupXLOG()/redo loop leads to deadlock #9955

Open alexanderlaw opened 1 day ago

alexanderlaw commented 1 day ago

The test_replica_query_race test fails sometimes with a statement timeout, e. g.: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9863/11991160666/index.html#/testresult/abae5a0a14db6afe https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9861/11978735655/index.html#/testresult/8dac1a69385b2308

asyncpg.exceptions.QueryCanceledError: canceling statement due to statement timeout

This issue can be reproduced just by running the test in a loop (running 10 instances in parallel speeds up the reproduction for me). I compiled binaries with CFLAGS="-O0 -DWAL_DEBUG", added

config_lines = ["wal_debug = on", "log_min_messages = DEBUG3", "log_min_error_statement = log", "log_connections = on", "log_disconnections = on", "log_line_prefix = '%m [%p:%l] %q%a '", "log_statement = 'all'"]

to the test endpoints' configuration, and ran:

DEFAULT_PG_VERSION=14 BUILD_TYPE=release USE_LFC=false poetry run pytest test_runner/regress/test_hot_standby.py::test_replica_query_race[release-pg14] 

With the statement timeout and the test timeout increased, I can see the following:

ITERATION 142
Sun Dec  1 08:55:00 UTC 2024
8       ======================== 1 passed, 1 warning in 18.54s =========================
10      ======================== 1 passed, 1 warning in 18.53s =========================
4       ======================== 1 passed, 1 warning in 19.04s =========================
3       ======================== 1 passed, 1 warning in 19.04s =========================
6       ======================== 1 passed, 1 warning in 19.04s =========================
1       ======================== 1 passed, 1 warning in 19.54s =========================
7       ======================== 1 passed, 1 warning in 19.54s =========================
5       ======================== 1 passed, 1 warning in 19.54s =========================
9       ======================== 1 passed, 1 warning in 20.04s =========================
2       ======================== 1 passed, 1 warning in 20.04s =========================
ITERATION 143
Sun Dec  1 08:55:21 UTC 2024
... hanging ...

test_output/test_replica_query_race_7[release-pg14]/repo/endpoints/standby/compute.log contains (with my comments added):

PG:2024-12-01 08:55:41.997 GMT [3694308:18687] LOG:  REDO @ 0/1810AB8; LSN 0/1810AE8: prev 0/1810A68; xid 9869; len 20 - Transaction/COMMIT: 2024-12-01 08:55:41.948542+00
PG:2024-12-01 08:55:41.997 GMT [3694308:18688] LOG:  REDO @ 0/1810AE8; LSN 0/1810CB8: prev 0/1810AB8; xid 0; len 8; blkref #0: rel 1663/13014/16397, blk 0 - Heap2/PRUNE: latestRemovedXid 9869 nredirected 1 ndead 0
PG:2024-12-01 08:55:41.997 GMT [3694314:28601] DEBUG:  sending write 0/18120D0 flush 0/18120D0 apply 0/1810AE8
PG:2024-12-01 08:55:41.997 GMT [3694308:18689] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/1810AE8, not_modified_since 0/16E33E8
PG:2024-12-01 08:55:41.997 GMT [3694335:6609] [unknown] LOG:  statement: SELECT clear_buffer_cache()
PG:2024-12-01 08:55:41.997 GMT [3694335:6610] [unknown] DEBUG:  updated min recovery point to 0/1810CB8 on timeline 1
PG:2024-12-01 08:55:41.997 GMT [3694335:6611] [unknown] CONTEXT:  writing block 0 of relation base/13014/16397
PG:2024-12-01 08:55:41.998 GMT [3694335:6612] [unknown] DEBUG:  bind <unnamed> to __asyncpg_stmt_1__
PG:2024-12-01 08:55:41.998 GMT [3694335:6613] [unknown] LOG:  execute __asyncpg_stmt_1__: SELECT counter FROM test
    ### standby_workload from the test executes a query
PG:2024-12-01 08:55:41.998 GMT [3694335:6614] [unknown] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/1810CB8, not_modified_since 0/1810CB8
    ### the query executing backend wants to get the page needed at LSN 0/1810CB8
PG:2024-12-01 08:55:41.998 GMT [3694308:18690] LOG:  REDO @ 0/1810CB8; LSN 0/1810D08: prev 0/1810AE8; xid 9870; len 18; blkref #0: rel 1663/13014/16397, blk 0 - Heap/HOT_UPDATE: off 205 xmax 9870 flags 0x10 ; new off 2 xmax 0
    ### this LSN is replayed just after that inside StartupXLOG(), see the corresponding code below
PG:2024-12-01 08:55:41.998 GMT [3694314:28602] DEBUG:  sendtime 2024-12-01 08:55:41.998248+00 receipttime 2024-12-01 08:55:41.998347+00 replication apply delay 61 ms transfer latency 1 ms
PG:2024-12-01 08:55:41.998 GMT [3694314:28603] DEBUG:  sending write 0/1812150 flush 0/18120D0 apply 0/1810CB8
PG:2024-12-01 08:55:41.998 GMT [3694314:28604] DEBUG:  sending write 0/1812150 flush 0/1812150 apply 0/1810CB8
    ### now these two processes are holding each other
...
PG:2024-12-01 08:55:52.008 GMT [3694335:6615] [unknown] LOG:  Waiting for recovery to catch up to 0/1810CB8
PG:2024-12-01 08:55:52.008 GMT [3694335:6616] [unknown] STATEMENT:  SELECT counter FROM test
    ### the query executing backend logs messages once in 10 seconds, inside XLogWaitForReplayOf()
...
PG:2024-12-01 08:56:02.009 GMT [3694335:6617] [unknown] LOG:  Waiting for recovery to catch up to 0/1810CB8
PG:2024-12-01 08:56:02.009 GMT [3694335:6618] [unknown] STATEMENT:  SELECT counter FROM test
...

Running processes of the standby instance:

law      3694300 3694266  0 08:55 pts/3    00:00:00 .../pg_install/v14/bin/postgres -D .../test_output/test_replica_query_race_7[release-pg14]/repo/endpoints/standby/pgdata
law      3694308 3694300  0 08:55 ?        00:00:00 postgres: startup recovering 000000010000000000000001
law      3694311 3694300  0 08:55 ?        00:00:00 postgres: checkpointer 
law      3694312 3694300  0 08:55 ?        00:00:00 postgres: background writer 
law      3694313 3694300  0 08:55 ?        00:00:00 postgres: stats collector 
law      3694314 3694300  0 08:55 ?        00:00:01 postgres: walreceiver streaming 0/1854F28
law      3694315 3694300  0 08:55 ?        00:00:03 postgres: cloud_admin postgres 127.0.0.1(44306) idle
law      3694335 3694300  0 08:55 ?        00:00:00 postgres: cloud_admin postgres 127.0.0.1(44360) SELECT
law      3694336 3694081  0 08:55 ?        00:00:01 postgres: cloud_admin postgres 127.0.0.1(51892) idle
law      3694785 3691682  0 09:04 ?        00:00:00 postgres: law storage_controller ::1(36894) idle

gdb sessions show:

Attaching to process 3694335

(gdb) bt
#0  0x00007a8e0672a007 in epoll_wait (epfd=10, events=0x6350740e8738, maxevents=1, timeout=10000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00006350721c5012 in WaitEventSetWaitBlock (set=0x6350740e86d8, cur_timeout=10000, occurred_events=0x7fff293671f0, nevents=1) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1477
#2  0x00006350721c4eb0 in WaitEventSetWait (set=0x6350740e86d8, timeout=10000, occurred_events=0x7fff293671f0, nevents=1, wait_event_info=83886088)
    at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1423
#3  0x00006350721c42b4 in WaitLatch (latch=0x7a8e04f5cc84, wakeEvents=41, timeout=10000, wait_event_info=83886088) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:494
#4  0x00006350721d4bba in ConditionVariableTimedSleep (cv=0x7a8e041045ec, timeout=10000, wait_event_info=83886088) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:163
#5  0x0000635071df12d0 in XLogWaitForReplayOf (redoEndRecPtr=25234616) at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:5392
#6  0x00007a8e06f67486 in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=0, request_lsns=0x7fff293673e0, buffers=0x7fff293673b0, nblocks=1, mask=0x0) at .../pgxn/neon/pagestore_smgr.c:2832
#7  0x00007a8e06f678bf in neon_read_at_lsn (rinfo=..., forkNum=MAIN_FORKNUM, blkno=0, request_lsns=..., buffer=0x7a8e04b20d80 "") at .../pgxn/neon/pagestore_smgr.c:2950
#8  0x00007a8e06f67aed in neon_read (reln=0x6350741f2d98, forkNum=MAIN_FORKNUM, blkno=0, buffer=0x7a8e04b20d80 "") at .../pgxn/neon/pagestore_smgr.c:2992
#9  0x00006350721f1058 in smgrread (reln=0x6350741f2d98, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7a8e04b20d80 "") at .../vendor/postgres-v14/src/backend/storage/smgr/smgr.c:515
#10 0x00006350721ade78 in ReadBuffer_common (smgr=0x6350741f2d98, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff293675d3)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1023
#11 0x00006350721ad568 in ReadBufferExtended (reln=0x7a8e040d5948, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:776
#12 0x0000635071d67c8c in heapgetpage (sscan=0x63507420c760, page=0) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:533
#13 0x0000635071d68f37 in heapgettup_pagemode (scan=0x63507420c760, dir=ForwardScanDirection, nkeys=0, key=0x0) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:1032
#14 0x0000635071d6a090 in heap_getnextslot (sscan=0x63507420c760, direction=ForwardScanDirection, slot=0x63507420c160) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:1529
#15 0x0000635071ffc3a8 in table_scan_getnextslot (sscan=0x63507420c760, direction=ForwardScanDirection, slot=0x63507420c160) at .../vendor/postgres-v14/src/include/access/tableam.h:1044
#16 0x0000635071ffc474 in SeqNext (node=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/nodeSeqscan.c:80
#17 0x0000635071fc1591 in ExecScanFetch (node=0x63507420bf70, accessMtd=0x635071ffc3dc <SeqNext>, recheckMtd=0x635071ffc485 <SeqRecheck>) at .../vendor/postgres-v14/src/backend/executor/execScan.c:132
#18 0x0000635071fc160a in ExecScan (node=0x63507420bf70, accessMtd=0x635071ffc3dc <SeqNext>, recheckMtd=0x635071ffc485 <SeqRecheck>) at .../vendor/postgres-v14/src/backend/executor/execScan.c:181
#19 0x0000635071ffc4d1 in ExecSeqScan (pstate=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/nodeSeqscan.c:112
#20 0x0000635071fbd6f7 in ExecProcNodeFirst (node=0x63507420bf70) at .../vendor/postgres-v14/src/backend/executor/execProcnode.c:464
#21 0x0000635071fb2326 in ExecProcNode (node=0x63507420bf70) at .../vendor/postgres-v14/src/include/executor/executor.h:260
#22 0x0000635071fb498e in ExecutePlan (estate=0x63507420bd48, planstate=0x63507420bf70, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=1, direction=ForwardScanDirection, dest=0x63507412e3b8, 
    execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:1560
#23 0x0000635071fb28b6 in standard_ExecutorRun (queryDesc=0x635074216e28, direction=ForwardScanDirection, count=1, execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:364
#24 0x0000635071fb273d in ExecutorRun (queryDesc=0x635074216e28, direction=ForwardScanDirection, count=1, execute_once=false) at .../vendor/postgres-v14/src/backend/executor/execMain.c:308
#25 0x00006350721fb9dc in PortalRunSelect (portal=0x6350741bf7a8, forward=true, count=1, dest=0x63507412e3b8) at .../vendor/postgres-v14/src/backend/tcop/pquery.c:921
#26 0x00006350721fb644 in PortalRun (portal=0x6350741bf7a8, count=1, isTopLevel=true, run_once=false, dest=0x63507412e3b8, altdest=0x63507412e3b8, qc=0x7fff29367c70)
    at .../vendor/postgres-v14/src/backend/tcop/pquery.c:765
#27 0x00006350721f6391 in exec_execute_message (portal_name=0x63507412df48 "", max_rows=1) at .../vendor/postgres-v14/src/backend/tcop/postgres.c:2217
#28 0x00006350721f991c in PostgresMain (argc=1, argv=0x7fff29367ec0, dbname=0x635074189990 "postgres", username=0x635074189970 "cloud_admin") at .../vendor/postgres-v14/src/backend/tcop/postgres.c:4672
#29 0x00006350721367e9 in BackendRun (port=0x635074181540) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:4541
#30 0x0000635072136044 in BackendStartup (port=0x635074181540) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:4263
#31 0x0000635072132068 in ServerLoop () at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1748
#32 0x00006350721317d2 in PostmasterMain (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1420
#33 0x00006350720336ec in main (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/main/main.c:249

(gdb) f 5
#5  0x0000635071df12d0 in XLogWaitForReplayOf (redoEndRecPtr=25234616) at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:5392
5392                    timeout = ConditionVariableTimedSleep(&XLogCtl->replayProgressCV,

(gdb) p/x redoEndRecPtr
$1 = 0x1810cb8

(gdb) f 10
#10 0x00006350721ade78 in ReadBuffer_common (smgr=0x6350741f2d98, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff293675d3)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1023
1023                            smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
(gdb) p *bufHdr
$2 = {tag = {rnode = {spcNode = 1663, dbNode = 13014, relNode = 16397}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 62, state = {value = 2248671234}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 57, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}
Attaching to process 3694308

(gdb) bt
#0  0x00007a8e0672a007 in epoll_wait (epfd=9, events=0x6350740e8738, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00006350721c5012 in WaitEventSetWaitBlock (set=0x6350740e86d8, cur_timeout=-1, occurred_events=0x7fff29364de0, nevents=1) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1477
#2  0x00006350721c4eb0 in WaitEventSetWait (set=0x6350740e86d8, timeout=-1, occurred_events=0x7fff29364de0, nevents=1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:1423
#3  0x00006350721c42b4 in WaitLatch (latch=0x7a8e04f61f04, wakeEvents=33, timeout=-1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/ipc/latch.c:494
#4  0x00006350721d4bba in ConditionVariableTimedSleep (cv=0x7a8e04ba5160, timeout=-1, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:163
#5  0x00006350721d4b19 in ConditionVariableSleep (cv=0x7a8e04ba5160, wait_event_info=134217734) at .../vendor/postgres-v14/src/backend/storage/lmgr/condition_variable.c:100
#6  0x00006350721b2e4c in WaitIO (buf=0x7a8e04aa3d00) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4514
#7  0x00006350721b2eaf in StartBufferIO (buf=0x7a8e04aa3d00, forInput=true) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4551
#8  0x00006350721ae3a7 in BufferAlloc (smgr=0x6350741c87e8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, strategy=0x0, foundPtr=0x7fff29364ff1)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1173
#9  0x00006350721ad8aa in ReadBuffer_common (smgr=0x6350741c87e8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff2936512f)
    at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:872
#10 0x00006350721ad639 in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:804
#11 0x0000635071e094be in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=0, mode=RBM_NORMAL) at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:483
#12 0x0000635071e09322 in XLogReadBufferForRedoExtended (record=0x63507418e228, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=false, buf=0x7fff2936526c)
    at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:410
#13 0x0000635071e08f30 in XLogReadBufferForRedo (record=0x63507418e228, block_id=0 '\000', buf=0x7fff2936526c) at .../vendor/postgres-v14/src/backend/access/transam/xlogutils.c:297
#14 0x0000635071d77f97 in heap_xlog_update (record=0x63507418e228, hot_update=true) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:9832
#15 0x0000635071d796ca in heap_redo (record=0x63507418e228) at .../vendor/postgres-v14/src/backend/access/heap/heapam.c:10271
#16 0x0000635071df6063 in StartupXLOG () at .../vendor/postgres-v14/src/backend/access/transam/xlog.c:7830
#17 0x00006350721380d3 in StartupProcessMain () at .../vendor/postgres-v14/src/backend/postmaster/startup.c:259
#18 0x0000635071e0e582 in AuxiliaryProcessMain (argc=2, argv=0x7fff29367ff0) at .../vendor/postgres-v14/src/backend/bootstrap/bootstrap.c:458
#19 0x0000635072136cfb in StartChildProcess (type=StartupProcess) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:5485
#20 0x00006350721317ae in PostmasterMain (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/postmaster/postmaster.c:1412
#21 0x00006350720336ec in main (argc=3, argv=0x6350740e76d0) at .../vendor/postgres-v14/src/backend/main/main.c:249

(gdb) f 7
#7  0x00006350721b2eaf in StartBufferIO (buf=0x7a8e04aa3d00, forInput=true) at .../vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:4551
4551                    WaitIO(buf);
(gdb) p *buf
$1 = {tag = {rnode = {spcNode = 1663, dbNode = 13014, relNode = 16397}, forkNum = MAIN_FORKNUM, blockNum = 0}, buf_id = 62, state = {value = 2248671234}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 57, state = {
      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}

That is, the client backend called ReadBuffer_common, where we have the following:

        /*
         * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
         * not currently in memory.
         */
        bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
                             strategy, &found);

so, it set IO_IN_PROGRESS for the buffer 16397/0, then it called smgrread() -> neon_read() and chose to wait for LSN 0/1810CB8 (sleeping on replayProgressCV)), which was not replayed at that moment.

Then the recovery process reads the record needed (REDO @ 0/1810CB8), logs a message about it and then calls heap_redo(), which tries to apply changes to the same buffer, that is marked as IO_IN_PROGRESS by the client backend. The corresponding StartupXLOG( code:

            /*
             * main redo apply loop
             */
            do
            {
...
                {
...
                    appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
                                     LSN_FORMAT_ARGS(ReadRecPtr),
                                     LSN_FORMAT_ARGS(EndRecPtr));
...
                }
...
                /* Now apply the WAL record itself */
                RmgrTable[record->xl_rmid].rm_redo(xlogreader);
...
                ConditionVariableBroadcast(&XLogCtl->replayProgressCV);
...

Thus, test_replica_query_race is flaky because of yet another race condition.

knizhnik commented 1 day ago

XLogWaitForReplayOf was added for replica support from the very beginning by @MMeent . Motivation is obvious: replica should not request more recent pages than not has applied at this moment. The question is how replica can ever request page with LSN larger than GetXLogReplayRecPtr() ? Request LSDN is taken from last-written-LSN cache. How we can get here LSN large then replay LSN? Looks like it is not possible. But from the stack traces above we can see that it actually happen. lastReplayedEndRecPtr is advanced after applying WAL record. WAL record can update multiples pages. If some of this pages is evicted from shared buffers, then its lwlsn will be larger than lastReplayedEndRecPtr. If it will be requested once again, then we get this situation when request LSN is larger than apply LSN.

But do we actually need to wait for lastReplayedEndRecPtr? we need to wait until WAL is received, not replayed. But lwlsn can Neve be assigned larger LSN value than we receive. So it seems to be safe to remove this XLogWaitForReplayOf at all. @MMeent what do you think?