neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.35k stars 412 forks source link

storage_controller: nicer split-brain handling, and reduce 503 period during restarts #7797

Open jcsp opened 4 months ago

jcsp commented 4 months ago

Goal

Reduce or eliminate the period during controller restarts (typically a few seconds) where requests from the control plane may receive 503 responses.

Background

Here's how restarts work today:

Ideas

An example of a flow that might work:

  1. new pod starts
  2. new pod learns about existing leader from the database
  3. new pod calls into existing leader to request step down. Previous leader goes into mode where it returns 503 to all other APIs, stops any communication with pageservers/database, and in its response sends all Observed state.
  4. On 200 from step down request, new pod writes its own details into the database, builds its in-memory state in a few milliseconds based on the contents of the database and the Observed state it receives, and sets its readiness probe state to ready.
mickael-carl commented 2 months ago

When a controller is running but is not the leader, it may use k8s readiness probe to avoid receiving requests. Once it has successfully claimed leadership in the database (e.g. by writing itself as the leader, perhaps after making an API call to its peer to ask it to step down), it may consider itself ready.

(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?

jcsp commented 2 months ago

(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?

I hadn't thought about it: currently none of our storage services has a direct dependency on kubernetes.

Looking at the https://docs.rs/kube-coordinate/0.2.1/kube_coordinate/ API, it looks quite bare-bones: doesn't have a way to signal another leader to ask them to step down, or to find out who the leader currently is (if it's not yourself). Whether that's an issue depends on how we do traffic management: whether we are going to say a new node is "ready" before it becomes the leader (such that the leader will get a SIGTERM and step down), or if we need a way for the new node to tell the old node to step down before it can become ready.

mickael-carl commented 2 months ago

find out who the leader currently is

That might be an artifact of that particular library, as the API does define that:

holderIdentity (string)

holderIdentity contains the identity of the holder of a current lease.

The pattern I've seen used around leaders stepping down is to basically let the lease expire, and trigger a new leader election, which may not be suitable in this case (you could set the lease expiration to a very short time as a way to "fix" that though?). Regardless, I just thought I'd mention it, happy to chat about leases some more if that's something you'd want to consider (if not, I promise I won't be offended on behalf of that API 😛 )

VladLazar commented 3 weeks ago

Status as of 2024-08-19:

Plan for week of 2024-08-19: