GC Stops Working, Leading to High IPAM Resource Usage

mahmoudghonimfinout commented 1 month ago

Environment:

Kubernetes Provider: EKS
EKS Cluster Version: 1.30
Calico Version: v3.28.0
Calico Installation Method: Calico Operator
Additional Detail: Frequent node rotation on the cluster

Installation Configuration:

kind: Installation
apiVersion: operator.tigera.io/v1
metadata:
  name: default
spec:
  kubernetesProvider: EKS
  cni:
    type: Calico
  registry: quay.io
  calicoNetwork:
    bgp: Disabled
    ipPools:
      - cidr: 172.17.0.0/16
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled

Problem Description:

At some point, the garbage collection (GC) in Calico stops working, resulting in the accumulation of stale resources. This leads to the following issues:

High IPAM Resource Usage:
- IPAM blocks reach a count of 1024.
- Block affinities and IPAM handles also show high numbers, indicating they are not being cleaned up.
Impact on Pod Functionality:
- DNS resolution issues occur on the pods.
- To resolve the issue, I need to manually delete stale resources from IPAM blocks, block affinities, and handles.

Steps to Reproduce:

Install Calico on an EKS cluster using the configuration provided above.
Run workloads on the cluster over an extended period.
Ensure that the environment has a high rotation of nodes (frequent addition and removal of nodes - ?spot?).
Observe the IPAM blocks and handles accumulation.

Expected Behavior:

The Calico GC process should automatically clean up stale IPAM blocks, block affinities, and handles, ensuring resources are not exhausted.

Actual Behavior:

The GC process appears to stop working, causing the buildup of stale IPAM resources. This leads to DNS resolution errors on pods, requiring manual intervention to delete these stale resources.

Additional Information:

The issue is confirmed by observing the ipam_allocations_gc_candidates metric. When the problem occurs, there is a noticeable increase in the number of GC candidates, which continues to rise until it reaches a very high number, indicating that the GC process is not functioning as expected.
I can confirm that the issue is due to the GC stopping because I can see a significant increase in unused IPAM blocks, block affinities, and IPAM handles over time, which are not being cleaned up.
Manual deletion of these stale resources temporarily resolves the issue but is not a sustainable long-term solution.

Calico-kube-controllers log:

The log file is flooded with these entries -

ipam.go 1131: Calico Node referenced in IPAM data does not exist error=resource does not exist: 
ipam.go 1032: Garbage collecting leaked IP address handle=

caseydavenport commented 1 month ago

@mahmoudghonimfinout thanks for raising this to our attention. Did you notice if restarting calico-kube-controllers temporarily fixes the issue? Or is the only resolution to manually clean up the leaked resources?

mahmoudghonimfinout commented 1 month ago

@caseydavenport thank you for looking into this, restarting calico-kube-controllers did not solve the issue, only manual cleanup of the leaked resources

MichalFupso commented 2 weeks ago

Hi @mahmoudghonimfinout , it's possible that if your kube-controller pod is running on node that gets deleted frequently and then is restated on another node, it does not have enough time to release the unused IPs. The default leakGracePeriod is 15minutes, so if your kube-controllers is running for less than that, it does not have enough time to release the IPs after it has identified the leak. You could try moving the pod to the control node and also update the leakGracePeriod to a smaller value. https://docs.tigera.io/calico/latest/reference/resources/kubecontrollersconfig If that does not solve your issue could you please share more info with us, namely: list of pods, affinityblocks, ipamblocks, kube-controllers logs after kube-controllers restart and output of calicoctl ipam check

mahmoudghonimfinout commented 1 week ago

Hi @MichalFupso , In our case, we observed that the kube-controller was restarting, though less frequently than every 15 minutes—closer to every few hours. To address this, we migrated the kube-controllers to a dedicated control node and implemented an additional garbage collection job to clear unattached affinity blocks, handlers, and IPAM blocks. Since then, we haven’t seen the issue reoccur. However, without understanding the root cause, we’re a bit concerned about how this setup will perform at scale. Our setup is straightforward, as described in the initial comment. Any insights on whether this could stem from a bug or potential misconfiguration would be helpful.

MichalFupso commented 1 week ago

@mahmoudghonimfinout Would you please be able to share the logs mentioned above, that would help us understand better.

mahmoudghonimfinout commented 1 day ago

I only have the calico-kube-controller saved from the issue. calico-kube-controllers.log

projectcalico / calico