Open bradbehle opened 9 months ago
kubectl get ipamhandle
output:
calico-kube-controllers pod log:
Thanks @bradbehle - from what I can see in those files, the vast majority of the IPAM handles are from over 2 years ago. I suspect what has happened is that those IPAM handles were leaked before we had implemented many of the garbage collection improvements in more recent releases.
The kube-controllers code itself doesn't collect stray handles if the IP addresses associated with them don't exist, which is exactly the state this cluster is in. This is obviously a limitation, but in most normal operation the handle and IP are released as close to atomically as the k8s API allows (and the handle is deleted first before the allocation) so you wouldn't see this state.
If you can confirm that the leaked handles are all old, and that there are no new leaks occurring, then I think it would be safe to clean the old handles up using calicoctl
and chalk this up to older versions of Calico leaving cruft behind.
There is probably a pretty good case to be made for kube-controllers cleaning up handles with no allocation the same way that it cleans up allocations themselves.
Thanks for looking into this, that explains it - we can confirm that all leaked handles are old.
I think here are the details about the Calico fix for this: https://github.com/projectcalico/calico/issues/6988#issuecomment-1331016577
With calico v3.24.5 I did the cleanup procedure and it removed the stale handles with no matching IPs, so probably this can be closed.
@rptaylor that fix improved the calicoctl IPAM cleanup code to release the handles, but this issue is more about automated GC that doesn't require intervention via calicoctl.
Expected Behavior
calico-kube-controller should clean up leaked pod IPs / IPAMHandles
Current Behavior
In at least this cluster we are looking at, calico-kube-controller is not cleaning these up
Possible Solution
Figure out why calico-kube-controller doesn't seem to notice these 30,000 leaked IPAM handles, but
calicoctl ipam check
does (see details belowSteps to Reproduce (for bugs)
Context
We don't think this is causing any noticeable problems with the cluster at the moment, but the etcd performance can't be helped by all these leaked CRDs. We could probably run
calicoctl ipam release
and clean these up. But the concern is that this is probably also a problem on a bunch of old clusters we maintain, and eventually could become a problem, so we were hoping if we provide information about this cluster, someone could determine why calico-kube-controllers isn't cleaning these up, and fix it in an upcoming Calico release.Your Environment
Here's the information that shows all the leaked IPAM handles (with IP addresses obscured):
Here's the object counts in etcd to confirm the large number:
I've also attached the output of
get ipamhandles.crd.projectcalico.org -o wide
andget ipamhandles.crd.projectcalico.org -o custom-columns=.:.spec
, which shows that almost all the CRDs are over 2 years old. I also included the calico-kube-controllers pod log that shows not much cleanup happening. Please let me know if you would like more information about this cluster.