Open jcsp opened 1 day ago
The compute fails to restart due to a panic in the WAL proposer (from ep-2/compute.log
):
PG:2024-10-02 19:33:38.546 GMT [392338] PANIC: [WP] collected propEpochStartLsn 0/2000000, but basebackup LSN 0/152CBC8
PG:2024-10-03 01:01:17.098 GMT [332945] PANIC: [WP] collected propEpochStartLsn 0/2002078, but basebackup LSN 0/15206D8
I'm not familiar with the background here, perhaps @arssher can shed some light on it.
https://github.com/neondatabase/neon/pull/9099 (merged 01.10 17:54 UTC) was supposed to fix this. I haven't exactly verified whether this PR is included in the runs above (with force pushes it's probably impossible), but let's see if failures repeat on fresh PRs.
Random notes from poking around:
propEpochStartLsn
is taken from the safekeeper's flushLsn
(the last durable LSN).pageserver
constructs and sends the basebackup in response to an RPC call. The caller provides the LSN.0/152CBC8
LSN for the base backup.Heading into a meeting, will pick this back up on Monday.
Thanks @arssher, #9099 makes sense. Let's look for any newer failures.
I checked all the failures on main
after October 1st. None of them were because of this problem, but rather errors on test cleanup and such (e.g. postmaster.pid
missing on stop
). I'll have a closer look at them next week.
This test is failing ~1.5% of the time.
It was added in https://github.com/neondatabase/neon/pull/8943
Examples:
Case 1
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6560/11150102484/index.html#testresult/32593da5b1e125d1/retries
Case 2
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9232/11153678585/index.html#/testresult/ace582de47d6f7a9