neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.27k stars 408 forks source link

pageserver: persist configuration invariants #8309

Open jcsp opened 2 months ago

jcsp commented 2 months ago

There are two configuration properties that should never be changed after first running a pageserver:

When deploying with a declarative configuration management system, it is relatively easy to make a mistake and modify one or other of these properties, resulting in "weird" situations that are hard to debug.

To improve the operational resilience of the system, we should store these invariants on the pageserver's local storage, and refuse to start up if the values in the pageserver config file disagree with the stored invariants, so that a configuration error becomes very obvious and explicit, rather than risking hard-to-diagnose behaviors.

jcsp commented 1 month ago

In https://github.com/neondatabase/neon/pull/7766 the id moves out of config and into the identity file. Since the identity.toml is written externally, we still need some file written by the pageserver itself that we can reasonably expect that deployment tools won't mess with.

problame commented 1 month ago

Just discovered this issue. Let me paraphrase the idea:

Instead of having an explicit "--init mode" (which Vlad and I have been hard at work killing, final PR in that series is https://github.com/neondatabase/neon/pull/7766), you are proposing that PS have an implicit init mode where

Correctly understood?

Some additional thoughts on control_plane_api:

jcsp commented 1 month ago

I can totally foresee us needing to change the control_plane_api url.

Yes: this proposal doesn't prevent that, but it requires it to be done in a very explicit way: one can't just modify ansible config, one has to modify that and explicitly reset the invariants file.

I think it would make more sense to have the notion of a control plane identity

I can see the merit of that, although it would bring its own problems: e.g. if we needed to recover the storage controller state from S3 in a disaster, we'd still have to manually tell all the pageservers to forget the old ID. The pageserver behaviour on a configuration snafu also becomes a bit less immediate (rather than terminating on startup if the URL is changed, we would get as far as talking to the "wrong" controller and then terminating when its ID doesn't match).