projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.04k stars 1.35k forks source link

GC Stops Working, Leading to High IPAM Resource Usage #9371

Open mahmoudghonimfinout opened 1 month ago

mahmoudghonimfinout commented 1 month ago

Environment:

Installation Configuration:

kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
  name: default
spec:
  kubernetesProvider: EKS
  cni:
    type: Calico
  registry: quay.io
  calicoNetwork:
    bgp: Disabled
    ipPools:
      - cidr: 172.17.0.0/16
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled

Problem Description:

At some point, the garbage collection (GC) in Calico stops working, resulting in the accumulation of stale resources. This leads to the following issues:

Steps to Reproduce:

  1. Install Calico on an EKS cluster using the configuration provided above.
  2. Run workloads on the cluster over an extended period.
  3. Ensure that the environment has a high rotation of nodes (frequent addition and removal of nodes - ?spot?).
  4. Observe the IPAM blocks and handles accumulation.

Expected Behavior:

The Calico GC process should automatically clean up stale IPAM blocks, block affinities, and handles, ensuring resources are not exhausted.

Actual Behavior:

The GC process appears to stop working, causing the buildup of stale IPAM resources. This leads to DNS resolution errors on pods, requiring manual intervention to delete these stale resources.

Additional Information:

Calico-kube-controllers log:

The log file is flooded with these entries -

ipam.go 1131: Calico Node referenced in IPAM data does not exist error=resource does not exist: 
ipam.go 1032: Garbage collecting leaked IP address handle=
caseydavenport commented 1 month ago

@mahmoudghonimfinout thanks for raising this to our attention. Did you notice if restarting calico-kube-controllers temporarily fixes the issue? Or is the only resolution to manually clean up the leaked resources?

mahmoudghonimfinout commented 1 month ago

@caseydavenport thank you for looking into this, restarting calico-kube-controllers did not solve the issue, only manual cleanup of the leaked resources

MichalFupso commented 2 weeks ago

Hi @mahmoudghonimfinout , it's possible that if your kube-controller pod is running on node that gets deleted frequently and then is restated on another node, it does not have enough time to release the unused IPs. The default leakGracePeriod is 15minutes, so if your kube-controllers is running for less than that, it does not have enough time to release the IPs after it has identified the leak. You could try moving the pod to the control node and also update the leakGracePeriod to a smaller value. https://docs.tigera.io/calico/latest/reference/resources/kubecontrollersconfig If that does not solve your issue could you please share more info with us, namely: list of pods, affinityblocks, ipamblocks, kube-controllers logs after kube-controllers restart and output of calicoctl ipam check

mahmoudghonimfinout commented 1 week ago

Hi @MichalFupso , In our case, we observed that the kube-controller was restarting, though less frequently than every 15 minutes—closer to every few hours. To address this, we migrated the kube-controllers to a dedicated control node and implemented an additional garbage collection job to clear unattached affinity blocks, handlers, and IPAM blocks. Since then, we haven’t seen the issue reoccur. However, without understanding the root cause, we’re a bit concerned about how this setup will perform at scale. Our setup is straightforward, as described in the initial comment. Any insights on whether this could stem from a bug or potential misconfiguration would be helpful.

MichalFupso commented 1 week ago

@mahmoudghonimfinout Would you please be able to share the logs mentioned above, that would help us understand better.

mahmoudghonimfinout commented 1 day ago

I only have the calico-kube-controller saved from the issue. calico-kube-controllers.log