Open mahmoudghonimfinout opened 1 month ago
@mahmoudghonimfinout thanks for raising this to our attention. Did you notice if restarting calico-kube-controllers
temporarily fixes the issue? Or is the only resolution to manually clean up the leaked resources?
@caseydavenport thank you for looking into this, restarting calico-kube-controllers
did not solve the issue, only manual cleanup of the leaked resources
Hi @mahmoudghonimfinout , it's possible that if your kube-controller pod is running on node that gets deleted frequently and then is restated on another node, it does not have enough time to release the unused IPs. The default leakGracePeriod is 15minutes, so if your kube-controllers is running for less than that, it does not have enough time to release the IPs after it has identified the leak. You could try moving the pod to the control node and also update the leakGracePeriod to a smaller value. https://docs.tigera.io/calico/latest/reference/resources/kubecontrollersconfig If that does not solve your issue could you please share more info with us, namely: list of pods, affinityblocks, ipamblocks, kube-controllers logs after kube-controllers restart and output of calicoctl ipam check
Hi @MichalFupso , In our case, we observed that the kube-controller was restarting, though less frequently than every 15 minutes—closer to every few hours. To address this, we migrated the kube-controllers to a dedicated control node and implemented an additional garbage collection job to clear unattached affinity blocks, handlers, and IPAM blocks. Since then, we haven’t seen the issue reoccur. However, without understanding the root cause, we’re a bit concerned about how this setup will perform at scale. Our setup is straightforward, as described in the initial comment. Any insights on whether this could stem from a bug or potential misconfiguration would be helpful.
@mahmoudghonimfinout Would you please be able to share the logs mentioned above, that would help us understand better.
I only have the calico-kube-controller saved from the issue. calico-kube-controllers.log
Environment:
Installation Configuration:
Problem Description:
At some point, the garbage collection (GC) in Calico stops working, resulting in the accumulation of stale resources. This leads to the following issues:
Steps to Reproduce:
Expected Behavior:
The Calico GC process should automatically clean up stale IPAM blocks, block affinities, and handles, ensuring resources are not exhausted.
Actual Behavior:
The GC process appears to stop working, causing the buildup of stale IPAM resources. This leads to DNS resolution errors on pods, requiring manual intervention to delete these stale resources.
Additional Information:
Calico-kube-controllers log:
The log file is flooded with these entries -