Open alexanderlaw opened 1 day ago
XLogWaitForReplayOf
was added for replica support from the very beginning by @MMeent .
Motivation is obvious: replica should not request more recent pages than not has applied at this moment.
The question is how replica can ever request page with LSN larger than GetXLogReplayRecPtr()
?
Request LSDN is taken from last-written-LSN cache. How we can get here LSN large then replay LSN?
Looks like it is not possible. But from the stack traces above we can see that it actually happen.
lastReplayedEndRecPtr
is advanced after applying WAL record. WAL record can update multiples pages.
If some of this pages is evicted from shared buffers, then its lwlsn will be larger than lastReplayedEndRecPtr
. If it will be requested once again, then we get this situation when request LSN is larger than apply LSN.
But do we actually need to wait for lastReplayedEndRecPtr
? we need to wait until WAL is received, not replayed. But lwlsn can Neve be assigned larger LSN value than we receive. So it seems to be safe to remove this XLogWaitForReplayOf
at all. @MMeent what do you think?
The test_replica_query_race test fails sometimes with a statement timeout, e. g.: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9863/11991160666/index.html#/testresult/abae5a0a14db6afe https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9861/11978735655/index.html#/testresult/8dac1a69385b2308
This issue can be reproduced just by running the test in a loop (running 10 instances in parallel speeds up the reproduction for me). I compiled binaries with CFLAGS="-O0 -DWAL_DEBUG", added
to the test endpoints' configuration, and ran:
With the statement timeout and the test timeout increased, I can see the following:
test_output/test_replica_query_race_7[release-pg14]/repo/endpoints/standby/compute.log contains (with my comments added):
Running processes of the standby instance:
gdb sessions show:
That is, the client backend called ReadBuffer_common, where we have the following:
so, it set IO_IN_PROGRESS for the buffer 16397/0, then it called smgrread() -> neon_read() and chose to wait for LSN 0/1810CB8 (sleeping on replayProgressCV)), which was not replayed at that moment.
Then the recovery process reads the record needed (REDO @ 0/1810CB8), logs a message about it and then calls heap_redo(), which tries to apply changes to the same buffer, that is marked as IO_IN_PROGRESS by the client backend. The corresponding StartupXLOG( code:
Thus, test_replica_query_race is flaky because of yet another race condition.