Open jcsp opened 4 months ago
When a controller is running but is not the leader, it may use k8s readiness probe to avoid receiving requests. Once it has successfully claimed leadership in the database (e.g. by writing itself as the leader, perhaps after making an API call to its peer to ask it to step down), it may consider itself ready.
(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?
(For my own understanding) There's the lease API in k8s for leader election, is there any reason to not use that?
I hadn't thought about it: currently none of our storage services has a direct dependency on kubernetes.
Looking at the https://docs.rs/kube-coordinate/0.2.1/kube_coordinate/ API, it looks quite bare-bones: doesn't have a way to signal another leader to ask them to step down, or to find out who the leader currently is (if it's not yourself). Whether that's an issue depends on how we do traffic management: whether we are going to say a new node is "ready" before it becomes the leader (such that the leader will get a SIGTERM and step down), or if we need a way for the new node to tell the old node to step down before it can become ready.
find out who the leader currently is
That might be an artifact of that particular library, as the API does define that:
holderIdentity (string)
holderIdentity contains the identity of the holder of a current lease.
The pattern I've seen used around leaders stepping down is to basically let the lease expire, and trigger a new leader election, which may not be suitable in this case (you could set the lease expiration to a very short time as a way to "fix" that though?). Regardless, I just thought I'd mention it, happy to chat about leases some more if that's something you'd want to consider (if not, I promise I won't be offended on behalf of that API 😛 )
Status as of 2024-08-19:
Plan for week of 2024-08-19:
Goal
Reduce or eliminate the period during controller restarts (typically a few seconds) where requests from the control plane may receive 503 responses.
Background
Here's how restarts work today:
Ideas
An example of a flow that might work: