neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.87k stars 435 forks source link

Failure in `test_hot_standby_gc` #8801

Open jcsp opened 2 months ago

jcsp commented 2 months ago

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8591/10488910985/index.html#suites/950eff205b552e248417890b8b8f189e/c8bc7f8c49c04b80

psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 26922 failed: FATAL:  [NEON_SMGR] [shard 0] could not read block 0 in rel 1664/0/2676.0 from page server at lsn 0/016A6020
DETAIL:  page server returned error: Bad request: tried to request a page version that was garbage collected. requested at 0/16A6020 gc cutoff 0/45EAF70

This test is specifically exercising that the pageserver doesn't GC the data that a lagging replica needs, so this failure mode looks like a real bug.

jcsp commented 2 months ago

Looking at the code/test, it's not obvious what guarantees that the standby_horizon feedback should reach the pageserver before the GC is run:

I can make this test fail reliably by running multiple safekeepers and stopping one of them partway through the test, so it seems likely that there is something unsound about how we're relying on standby_horizon updates to propagate, which might come up (rarely) even with a single safekeeper.