kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
108.97k stars 39.05k forks source link

Cloud Controller Managers do not release leader election when shutting down #119905

Open JoelSpeed opened 11 months ago

JoelSpeed commented 11 months ago

What happened?

When using the cloud controller manager libraries k8s.io/cloud-provider to build out a cloud controller manager, the libraries configure leader election.

They expose flags to configure this leader election, and, when these values are longer, can cause downtime when the pod restarts.

Other controller managers have the ability to release the leader election lease when they know they are shutting down, this minimises the time taken for a new leader to take over.

With the CCM, we have been configuring lease durations of 137s (some maths went into this but it doesn't matter right now), and so, in the worst case, we are seeing over 2 minutes of downtime when the pods are updated and switched over.

This is important to fix because, when the CCM isn't running, new nodes cannot be initialized and load balancer membership cannot be updated.

What did you expect to happen?

The controller manager, on shutdown, should release the leader lease using the ReleaseOnCancel option in the configuration of the leader election config.

How can we reproduce it (as minimally and precisely as possible)?

Run any cloud controller manager that uses the cloud-provider library, GCP or AWS are good candidates, and set the leader config to --leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26. Then change the deployment to create a rollout. During the rollout, you will notice it takes a long time for the leader election lease to be granted to the new pods.

Anything else we need to know?

This was discussed on the sig-cloud-provider call on the 5th of July, 2023, and it was decided that this is something we should aim to fix, since the kube-controller-manager is also working towards this goal. /sig cloud-provider

Kubernetes version

Any

Cloud provider

AWS, GCP, any that uses k8s.io/cloud-provider/app

OS version

N/A

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

N/A

bridgetkromhout commented 11 months ago

/triage accepted

JoelSpeed commented 11 months ago

We had some discussion about this on the sig-cloud-provider last week, some questions came up that would be good to answer, adding those here so we don't forget: