Open jcsp opened 2 months ago
Looking at the code/test, it's not obvious what guarantees that the standby_horizon feedback should reach the pageserver before the GC is run:
SafekeeperTimelineInfo
arrive at the pageserver between the standby postgres starting, and the test calling into GC.record_safekeeper_info
, although this test runs with just a single safekeeper anyway.I can make this test fail reliably by running multiple safekeepers and stopping one of them partway through the test, so it seems likely that there is something unsound about how we're relying on standby_horizon updates to propagate, which might come up (rarely) even with a single safekeeper.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8591/10488910985/index.html#suites/950eff205b552e248417890b8b8f189e/c8bc7f8c49c04b80
This test is specifically exercising that the pageserver doesn't GC the data that a lagging replica needs, so this failure mode looks like a real bug.