Open JoelSpeed opened 11 months ago
/triage accepted
We had some discussion about this on the sig-cloud-provider last week, some questions came up that would be good to answer, adding those here so we don't forget:
ReleaseOnCancel
can we drop the panic
that happens normally in OnStoppedLeading
, usually via klog.Fatalf
app
, are aware of the requirements of their code to be compatible with ReleaseOnCancel
?ReleaseOnCancel
too, what happens if we don't have the config synced there?
What happened?
When using the cloud controller manager libraries
k8s.io/cloud-provider
to build out a cloud controller manager, the libraries configure leader election.They expose flags to configure this leader election, and, when these values are longer, can cause downtime when the pod restarts.
Other controller managers have the ability to release the leader election lease when they know they are shutting down, this minimises the time taken for a new leader to take over.
With the CCM, we have been configuring lease durations of 137s (some maths went into this but it doesn't matter right now), and so, in the worst case, we are seeing over 2 minutes of downtime when the pods are updated and switched over.
This is important to fix because, when the CCM isn't running, new nodes cannot be initialized and load balancer membership cannot be updated.
What did you expect to happen?
The controller manager, on shutdown, should release the leader lease using the
ReleaseOnCancel
option in the configuration of the leader election config.How can we reproduce it (as minimally and precisely as possible)?
Run any cloud controller manager that uses the cloud-provider library, GCP or AWS are good candidates, and set the leader config to
--leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26
. Then change the deployment to create a rollout. During the rollout, you will notice it takes a long time for the leader election lease to be granted to the new pods.Anything else we need to know?
This was discussed on the sig-cloud-provider call on the 5th of July, 2023, and it was decided that this is something we should aim to fix, since the kube-controller-manager is also working towards this goal. /sig cloud-provider
Kubernetes version
Any
Cloud provider
AWS, GCP, any that uses
k8s.io/cloud-provider/app
OS version
N/A
Install tools
N/A
Container runtime (CRI) and version (if applicable)
N/A
Related plugins (CNI, CSI, ...) and versions (if applicable)
N/A