Open jcsp opened 2 months ago
In https://github.com/neondatabase/neon/pull/7766 the id
moves out of config and into the identity file. Since the identity.toml
is written externally, we still need some file written by the pageserver itself that we can reasonably expect that deployment tools won't mess with.
Just discovered this issue. Let me paraphrase the idea:
Instead of having an explicit "--init mode" (which Vlad and I have been hard at work killing, final PR in that series is https://github.com/neondatabase/neon/pull/7766), you are proposing that PS have an implicit init mode where
Correctly understood?
Some additional thoughts on control_plane_api
:
control_plane_api
url.I can totally foresee us needing to change the control_plane_api url.
Yes: this proposal doesn't prevent that, but it requires it to be done in a very explicit way: one can't just modify ansible config, one has to modify that and explicitly reset the invariants file.
I think it would make more sense to have the notion of a control plane identity
I can see the merit of that, although it would bring its own problems: e.g. if we needed to recover the storage controller state from S3 in a disaster, we'd still have to manually tell all the pageservers to forget the old ID. The pageserver behaviour on a configuration snafu also becomes a bit less immediate (rather than terminating on startup if the URL is changed, we would get as far as talking to the "wrong" controller and then terminating when its ID doesn't match).
There are two configuration properties that should never be changed after first running a pageserver:
control_plane_api
: if this changes, the pageserver will register itself with some other control plane endpoint, and might end with two control planes fighting to manage it.id
: the node ID of a pageserver shouldn't change over its lifetime. This is legal in principle, but inefficient: changing ID would cause a pageserver to drop all its local content on /re_attach, and the controller will reject its registration attempts as long as some other node with the same hostname exists.When deploying with a declarative configuration management system, it is relatively easy to make a mistake and modify one or other of these properties, resulting in "weird" situations that are hard to debug.
To improve the operational resilience of the system, we should store these invariants on the pageserver's local storage, and refuse to start up if the values in the pageserver config file disagree with the stored invariants, so that a configuration error becomes very obvious and explicit, rather than risking hard-to-diagnose behaviors.