storage_controller: nicer split-brain handling, and reduce 503 period during restarts

jcsp commented 4 months ago

Goal

Reduce or eliminate the period during controller restarts (typically a few seconds) where requests from the control plane may receive 503 responses.

Background

Here's how restarts work today:

We run in a k8s Deployment with strategy set to Recreate, which means it kills old pods before starting new ones.
Safety: The controller is safe against split-brain situations because when we update generations we're always doing it in isolated database transactions, but if we did have two nodes running at the same time things could get kind of weird (controllers disagreeing about where to attach a tenant), and at worst this could cause availability issues for tenants that underwent changes while two controllers were alive.
At startup: we read from all pageservers' APIs to learn latest tenant locations. This usually takes ~1s but if pageservers are slow or unavailable it can take tens of seconds
Once we've talked to all the pageservers, the controller becomes available.

Ideas

Let's assume that we will run the k8s deployment with rolling restarts (i.e. start replacement before killing existing pod), and figure out how to make that work smoothly. Figuring out which instancce is leader:
- We may be able to improve split-brain behavior at the same time as making restarts smoother, by putting some state in the database that lets the running controller publish its own address and either some lease-like timestamp, or some actual long running database transaction that blocks other controllers from starting up.
- When a controller is running but is not the leader, it may use k8s readiness probe to avoid receiving requests. Once it has successfully claimed leadership in the database (e.g. by writing itself as the leader, perhaps after making an API call to its peer to ask it to step down), it may consider itself ready.
Hiding the delay of scanning pageservers:
- If we are doing a restart of a healthy system, we could avoid the pageserver scan on startup by transferring all the in-memory state from one node to another. However, that's kind of involved: it means defining a cross-version serialization format for all the in-memory stuff.
- Another way to achieve this would be to ask the current leader to transfer only its Observed state for shards when it steps down: that way we limit the scope of what in-memory state needs transferring, and the new node can start up deterministically quickly because it doesn't have to call out to pageservers.

An example of a flow that might work:

new pod starts
new pod learns about existing leader from the database
new pod calls into existing leader to request step down. Previous leader goes into mode where it returns 503 to all other APIs, stops any communication with pageservers/database, and in its response sends all Observed state.
On 200 from step down request, new pod writes its own details into the database, builds its in-memory state in a few milliseconds based on the contents of the database and the Observed state it receives, and sets its readiness probe state to ready.

mickael-carl commented 2 months ago

When a controller is running but is not the leader, it may use k8s readiness probe to avoid receiving requests. Once it has successfully claimed leadership in the database (e.g. by writing itself as the leader, perhaps after making an API call to its peer to ask it to step down), it may consider itself ready.

(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?

jcsp commented 2 months ago

(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?

I hadn't thought about it: currently none of our storage services has a direct dependency on kubernetes.

Looking at the https://docs.rs/kube-coordinate/0.2.1/kube_coordinate/ API, it looks quite bare-bones: doesn't have a way to signal another leader to ask them to step down, or to find out who the leader currently is (if it's not yourself). Whether that's an issue depends on how we do traffic management: whether we are going to say a new node is "ready" before it becomes the leader (such that the leader will get a SIGTERM and step down), or if we need a way for the new node to tell the old node to step down before it can become ready.

mickael-carl commented 2 months ago

find out who the leader currently is

That might be an artifact of that particular library, as the API does define that:

holderIdentity (string)

holderIdentity contains the identity of the holder of a current lease.

The pattern I've seen used around leaders stepping down is to basically let the lease expire, and trigger a new leader election, which may not be suitable in this case (you could set the lease expiration to a very short time as a way to "fix" that though?). Regardless, I just thought I'd mention it, happy to chat about leases some more if that's something you'd want to consider (if not, I promise I won't be offended on behalf of that API 😛 )

VladLazar commented 3 weeks ago

Status as of 2024-08-19:

Most of the code and tests have been merged
Did some manual testing as well

Plan for week of 2024-08-19:

Figure out db migrations
Sort out cross storcon JWT tokens
Update helm chart to give storcon IP

neondatabase / neon