Open engedaam opened 4 weeks ago
/triage accepted
We just need to validate that we aren't leaking leases here, right? Through Prom metrics in our soak testing and E2E testing? And then we should be good to confirm and remove this controller?
/retitle Remove the Lease garbage collection controller
@jonathan-innis Yeah, the main work needed here is to get some metrics, if any node leases are leaked, to confirm we are okay to remove the controller.
Description
What problem are you trying to solve?
When Karpenter deleted a Node object while kubelet was alive, due to ignoring errors in kubelet's lease logic, it created lease without ownerReference set: https://github.com/kubernetes/kubernetes/issues/109777.
Karpenter used to not wait until the underlaying VM/kubelet to be full terminated prior to removing the karpenter finalizers from the NodeClaim and Node. This resulted in node leases being leaked into the cluster, as the terminating kubelet would create a phantom lease prior to deletion: https://github.com/aws/karpenter-provider-aws/issues/4363. This resulted in Karpenter causing a lease leak effect.
As a mitigation effort the Karpenter team implemented a lease garage collection controller to delete any leaked node leases: https://github.com/kubernetes-sigs/karpenter/pull/471
The team recently moved to waiting for underlaying VMs to be fully terminated prior to removing NodeClaim and Node finalizer, which will eliminate the a terminating kubelet from creating a phantom leases: https://github.com/kubernetes-sigs/karpenter/pull/1195
We will need to validate by waiting for instance termination this will result in Karpenter not leaking node leases.