neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 429 forks source link

safekeeper: Fix local_start_lsn on sk4 #6449

Closed petuhovskiy closed 5 months ago

petuhovskiy commented 9 months ago

We're seeing errors on safekeeper-4 in us-east-2, and they are localized only on this safekeeper.

2024-01-23T15:12:42.309924Z ERROR {cid=12138 ttid=XXX/YYY}:WAL sender: terminated: Other(Failed to open WAL segment download stream for remote path RemotePath("XXX/YYY/000000010000000000000003")
Caused by:
    No file found for the remote object id given
Stack backtrace:

safekeeper-4 is the only safekeeper that has timelines with local_start_lsn != timeline_start_lsn. This is quite possibly causing the issues here, as this error can happen if client (pageserver) requests WAL from the same segment as where local_start_lsn is located, but before local_start_lsn itself. The logic in safekeepers prevents reading uninitialized WAL and safekeeper is trying to read WAL from remote storage, but it can be unavailable if this segments wasn't uploaded yet.

The plan is:

Related slack threads:

petuhovskiy commented 5 months ago

Fixed local_start_lsn on most of the timelines, details are in https://neondb.slack.com/archives/C04KGFVUWUQ/p1714052306923869