Calico freezing IPAMBlock cleanup

cmotta2016 commented 7 months ago

We're experiencing some "freezing" at IPAMBlocks cleanup in a specific cluster.
The following image shows what happens: calico-issue

The blue line represents the total amount of nodes in the cluster. The orange ones represent the total IPAMBlocks.

As you can see, for some reason Calico stops deleting IPAMBlocks after scaling down nodes. After manual removal and rollout restart of the calico-kube-controllers deployment, the situation normalizes for some time.

The calico-kube-controllers show us the following log snippet:

2024-03-22 12:39:08.269 [INFO][1] resources.go 350: Main client watcher loop
2024-03-22 12:42:16.998 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-08875d255d660e8b2"
2024-03-22 12:42:17.157 [INFO][1] ipam_block_reader_writer.go 334: Block must be empty but is not empty, refusing to remove affinity. host="i-08875d255d660e8b2" subnet="100.120.192.64/26"
2024-03-22 12:42:17.307 [ERROR][1] ipam.go 1061: Failed to release block affinities for node calicoNode="i-08875d255d660e8b2" error=block '100.120.192.64/26' is not empty
2024-03-22 12:42:17.307 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.120.192.64/26' is not empty node="i-08875d255d660e8b2"

The respective IPAMBlock yaml:

apiVersion: crd.projectcalico.org/v1
kind: IPAMBlock
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":null}'
  creationTimestamp: "2024-03-21T21:32:27Z"
  generation: 26
  name: 100-120-192-64-26
  resourceVersion: "1342708201"
  uid: 4729e0f1-4533-4ae8-b0ed-b781961079f0
spec:
  affinity: host:i-08875d255d660e8b2
  allocations:
  - 0
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  attributes:
  - handle_id: ipip-tunnel-addr-i-08875d255d660e8b2
    secondary:
      node: i-08875d255d660e8b2
      type: ipipTunnelAddress
  cidr: 100.120.192.64/26
  deleted: false
  sequenceNumber: 1711056747588873503
  sequenceNumberForAllocation:
    "0": 1711056747588873478
    "1": 1711056747588873479
    "2": 1711056747588873480
    "3": 1711056747588873481
    "4": 1711056747588873482
    "5": 1711056747588873483
    "6": 1711056747588873484
    "7": 1711056747588873486
    "8": 1711056747588873488
    "9": 1711056747588873491
    "10": 1711056747588873495
    "11": 1711056747588873497
    "12": 1711056747588873499
  strictAffinity: false
  unallocated:
  - 13
  - 14
  - 15
  - 16
  - 17
  - 18
  - 19
  - 20
  - 21
  - 22
  - 23
  - 24
  - 25
  - 26
  - 27
  - 28
  - 29
  - 30
  - 31
  - 32
  - 33
  - 34
  - 35
  - 36
  - 37
  - 38
  - 39
  - 40
  - 41
  - 42
  - 43
  - 44
  - 45
  - 46
  - 47
  - 48
  - 49
  - 50
  - 51
  - 52
  - 53
  - 54
  - 55
  - 56
  - 57
  - 58
  - 59
  - 60
  - 61
  - 62
  - 63
  - 6
  - 7
  - 5
  - 8
  - 1
  - 3
  - 4
  - 2
  - 9
  - 11
  - 10
  - 12

The node that no longer exists:

  k get node i-08875d255d660e8b2 -o yaml 
Error from server (NotFound): nodes "i-08875d255d660e8b2" not found

There seems to be a strong correlation between the scaling speed (up and down) of the cluster and Calico freezing.
We found this issue that reports behavior very similar to ours.
We also changed our kubecontrollersconfigurations in an attempt to speed up the cleanup process, but were unsuccessful.

...
spec:
  controllers:
    namespace:
      reconcilerPeriod: 30s
    node:
      leakGracePeriod: 30s
      reconcilerPeriod: 30s
      syncLabels: Enabled
    policy:
      reconcilerPeriod: 30s
    serviceAccount:
      reconcilerPeriod: 30s
    workloadEndpoint:
      reconcilerPeriod: 30s
  etcdV3CompactionPeriod: 10m0s

Expected Behavior

IPAMBlocks are released at the same speed as nodes are removed (avoiding freezing).

Current Behavior

IPAM entries aren't released.

Possible Solution

I honestly don't know 😢

Steps to Reproduce (for bugs)

Same from this comment.

Context

Because IPAMBlocks aren't leaked, the number of blocks only increases, requiring manual intervention to remove unused blocks, otherwise the number of blocks will run out and the entire cluster will stop scaling.

Your Environment

Calico version: v3.25.2 (we also tried running kube-controllers with v3.27.2 but not successfully)
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.26.11 (by kOps V1.28.1)
Operating System and version: Ubuntu 22.04.2 LTS (GNU/Linux 5.19.0-1028-aws x86_64)
Link to your project (optional):

frozenprocess commented 7 months ago

@caseydavenport

caseydavenport commented 7 months ago

2024-03-22 12:42:17.307 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.120.192.64/26' is not empty node="i-08875d255d660e8b2"

This is confirmed by the block yaml included (thank you) that there is indeed an allocation remaining:

  - handle_id: ipip-tunnel-addr-i-08875d255d660e8b2
    secondary:
      node: i-08875d255d660e8b2
      type: ipipTunnelAddress

The controller can't clean the block up if the IP is still present. The question then becomes "Why is the IP still allocated if the node is gone?"

The same pod handles cleaning up tunnel IP addresses when nodes are deleted. Full logs from kube-controllers would be helpful here, as well as kubectl get ipamhandles to see if the referenced handle actually has been deleted or not.

The fact that restarting kube-controllers fixes this does suggest it might be a bookkeeping error within kube-controllers somehow.

cmotta2016 commented 7 months ago

Hi @caseydavenport, thanks for your answer.
Let me bring some updates from this case and answer your questions.

We are using some Datadog metrics to try to help us to understand and solve this problem. The following three graphs show, respectively:

Number of ipamblocks objects in etcd vs the number of nodes in the cluster (kube_apiserver.storage_objects)
CPU usage of calico-kube-controller (light blue line) and calico-typha (dark blue line) deployments
And ipamblocks deletion API request (kube_apiserver.apiserver_request_total.count)

Screenshot 2024-03-26 at 13 41 57

As you can see, the last delete request was made at 21:20 (UTC-3). (Interestingly, just a few minutes before, the CPU usage of kube-controllers dropped drastically, but I couldn't identify why)

The kube-controllers logs (kubectl logs -f ....) were frozen at:

2024-03-26 12:18:15.922 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0758f883008706a4a"
2024-03-26 12:18:16.187 [INFO][1] ipam_block_reader_writer.go 334: Block must be empty but is not empty, refusing to remove affinity. host="i-0758f883008706a4a" subnet="100.110.244.128/26"
2024-03-26 12:18:16.429 [ERROR][1] ipam.go 1061: Failed to release block affinities for node calicoNode="i-0758f883008706a4a" error=block '100.110.244.128/26' is not empty
2024-03-26 12:18:16.429 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.110.244.128/26' is not empty node="i-0758f883008706a4a"

So, after restarting the controller (kubectl rollout restart deployment/calico-kube-controllers), we saw the following logs and behavior (I have deleted several lines in the examples below because the lines are practically the same. Here is the full log):

2024-03-26 12:27:17.109 [INFO][1] watchersyncer.go 130: Sending status update Status=resync
2024-03-26 12:27:17.109 [INFO][1] syncer.go 86: Node controller syncer status updated: resync
2024-03-26 12:27:17.109 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-03-26 12:27:17.109 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-03-26 12:27:17.110 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-03-26 12:27:17.168 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-03-26 12:27:17.168 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-03-26 12:27:17.390 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
I0326 12:27:17.407996       1 shared_informer.go:259] Caches are synced for nodes
I0326 12:27:17.408005       1 shared_informer.go:252] Waiting for caches to sync for pods
I0326 12:27:20.208692       1 shared_informer.go:259] Caches are synced for pods
2024-03-26 12:27:20.208 [INFO][1] ipam.go 241: Will run periodic IPAM sync every 15s
2024-03-26 12:27:20.209 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-03-26 12:27:20.209 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2024-03-26 12:27:20.216 [INFO][1] watchersyncer.go 130: Sending status update Status=in-sync
2024-03-26 12:27:20.216 [INFO][1] syncer.go 86: Node controller syncer status updated: in-sync
2024-03-26 12:27:20.224 [INFO][1] hostendpoints.go 173: successfully synced all hostendpoints
2024-03-26 12:27:20.236 [INFO][1] ipam.go 305: Syncer is InSync, kicking sync channel status=in-sync
2024-03-26 12:27:20.239 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0516e099b8aba8cf2) with error: nodes "i-0516e099b8aba8cf2" not found
2024-03-26 12:27:20.239 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-0516e099b8aba8cf2" ip="100.126.81.0" node="i-0516e099b8aba8cf2"
2024-03-26 12:27:20.241 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-021c7d7278087c5c7) with error: nodes "i-021c7d7278087c5c7" not found
2024-03-26 12:27:20.241 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-021c7d7278087c5c7" ip="100.108.60.192" node="i-021c7d7278087c5c7"
2024-03-26 12:27:20.244 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-016f1917477a45e6a) with error: nodes "i-016f1917477a45e6a" not found
2024-03-26 12:27:20.244 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-016f1917477a45e6a" ip="100.113.57.128" node="i-016f1917477a45e6a"c813e"
... <hundred of these log lines>
2024-03-26 12:30:15.242 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="ipip-tunnel-addr-i-0b28de410f20c813e" ip="100.126.188.64" node="i-0b28de410f20c813e"
2024-03-26 12:30:15.250 [INFO][1] ipam.go 1141: Updating assignments in block cidr=100.126.188.64/26
2024-03-26 12:30:15.439 [INFO][1] ipam.go 1966: Node doesn't exist, no need to release affinity cidr=100.126.188.64/26 host="i-0b28de410f20c813e"
2024-03-26 12:30:15.439 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="ipip-tunnel-addr-i-094419c6b04d05ff5" ip="100.121.216.192" node="i-094419c6b04d05ff5"
2024-03-26 12:30:15.446 [INFO][1] ipam.go 1141: Updating assignments in block cidr=100.121.216.192/26
2024-03-26 12:30:15.639 [INFO][1] ipam.go 1966: Node doesn't exist, no need to release affinity cidr=100.121.216.192/26 host="i-094419c6b04d05ff5"
2024-03-26 12:30:15.639 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="ipip-tunnel-addr-i-0164c8a2b91bff621" ip="100.108.245.64" node="i-0164c8a2b91bff621"
... <hundred of these log lines>
2024-03-26 13:00:56.646 [INFO][1] ipam.go 1141: Updating assignments in block cidr=100.113.216.192/26
2024-03-26 13:00:57.651 [INFO][1] ipam.go 1966: Node doesn't exist, no need to release affinity cidr=100.113.216.192/26 host="i-0a6c3e2349b0d7f05"
2024-03-26 13:00:57.657 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0516e099b8aba8cf2"
2024-03-26 13:01:00.704 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-021c7d7278087c5c7"
2024-03-26 13:01:03.913 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-016f1917477a45e6a"
... <hundred of these log lines>

Here we can see that the ipamblocks objects start to be deleted:

Screenshot 2024-03-26 at 14 00 55

Continuing logs:

2024-03-26 13:14:43.661 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-007338c1b7df83e16"
2024-03-26 13:14:46.863 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0a997d784ec34c0bc"
2024-03-26 13:14:47.393 [INFO][1] resources.go 378: Terminating main client watcher loop
2024-03-26 13:14:48.047 [INFO][1] resources.go 350: Main client watcher loop
2024-03-26 13:14:50.262 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-06bdb07413082e957"
2024-03-26 13:14:53.260 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0a82a8b615798d5e8"
... <hundred of these log lines>
2024-03-26 13:17:55.930 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"
2024-03-26 13:17:56.645 [INFO][1] ipam_block_reader_writer.go 334: Block must be empty but is not empty, refusing to remove affinity. host="i-02efc6aee4c4c7faa" subnet="100.121.206.64/26"
2024-03-26 13:17:57.131 [ERROR][1] ipam.go 1061: Failed to release block affinities for node calicoNode="i-02efc6aee4c4c7faa" error=block '100.121.206.64/26' is not empty
2024-03-26 13:17:57.131 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.121.206.64/26' is not empty node="i-02efc6aee4c4c7faa"

Some error logs but objects continue to be deleted:

2024-03-26 13:17:57.141 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0173aa5ad0c63a84e"
2024-03-26 13:17:59.729 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-017f881d8bb28295e"
... <hundred of these log lines>
2024-03-26 13:46:13.737 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0148f7cb61db78322"
2024-03-26 13:46:16.126 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-07a55c671c9a76567"

The last sequential deletion request was recorded here:

Screenshot 2024-03-26 at 14 06 19

Continuing logs:

2024-03-26 13:46:18.328 [WARNING][1] ipam.go 279: error syncing IPAM data error=block '100.125.34.192/26' is not empty
2024-03-26 13:46:18.346 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-08cb6f82168bb6788) with error: nodes "i-08cb6f82168bb6788" not found
2024-03-26 13:46:18.351 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0c57bd5b453789ca3) with error: nodes "i-0c57bd5b453789ca3" not found
2024-03-26 13:46:18.353 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-09cb1f9fd9c72f7cb) with error: nodes "i-09cb1f9fd9c72f7cb" not found
2024-03-26 13:46:18.356 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0d26e08b69c3f41b2) with error: nodes "i-0d26e08b69c3f41b2" not found
2024-03-26 13:46:18.361 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0e1420fae53ad1e30) with error: nodes "i-0e1420fae53ad1e30" not found
2024-03-26 13:46:18.363 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0dd618ef0734d4a52) with error: nodes "i-0dd618ef0734d4a52" not found
2024-03-26 13:46:18.369 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-091cfd13bf1d33624) with error: nodes "i-091cfd13bf1d33624" not found
2024-03-26 13:46:18.371 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0e738bc107eafb135) with error: nodes "i-0e738bc107eafb135" not found
2024-03-26 13:46:18.374 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0fdc9bc4d98bad952) with error: nodes "i-0fdc9bc4d98bad952" not found
2024-03-26 13:46:18.376 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-009bec4ed495dd858) with error: nodes "i-009bec4ed495dd858" not found
2024-03-26 13:46:18.379 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-06d0a2ee77f46b170) with error: nodes "i-06d0a2ee77f46b170" not found
2024-03-26 13:46:18.382 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-03a2015d38b2adf3b) with error: nodes "i-03a2015d38b2adf3b" not found
2024-03-26 13:46:18.384 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0a58a4e5c6613ce2e) with error: nodes "i-0a58a4e5c6613ce2e" not found
2024-03-26 13:46:18.388 [INFO][1] watchercache.go 125: Watch error received from Upstream ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=too old resource version: 1366792777 (1367047938)
2024-03-26 13:46:18.389 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-080a9690b4507c039) with error: nodes "i-080a9690b4507c039" not found
2024-03-26 13:46:18.391 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-07e9f668a95defd33) with error: nodes "i-07e9f668a95defd33" not found
2024-03-26 13:46:18.393 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-033baed2e03fa5410) with error: nodes "i-033baed2e03fa5410" not found
2024-03-26 13:46:18.396 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-07fc138d341fa4de9) with error: nodes "i-07fc138d341fa4de9" not found
2024-03-26 13:46:18.398 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-04c5eb4a64c9457cf) with error: nodes "i-04c5eb4a64c9457cf" not found
2024-03-26 13:46:18.400 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0818dce1acfa489ed) with error: nodes "i-0818dce1acfa489ed" not found
2024-03-26 13:46:18.403 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-00d37334994c55315) with error: nodes "i-00d37334994c55315" not found
2024-03-26 13:46:18.405 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-04eab46868594997c) with error: nodes "i-04eab46868594997c" not found
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.63e51e07dc5989c9c48c8409b48be21b526745ab88ffd82efc91c72771565dc4" ip="100.110.126.3" node="i-0760668425c3c1fb3" pod="machinery/node-problem-detector-28t4p"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.e5e311b4beabda7f1d23a365a211045ba764ddd008d4510ff491f1665b2ce240" ip="100.110.126.4" node="i-0760668425c3c1fb3" pod="kube-system/ebs-csi-node-kr4x7"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.39720b2e9797717118e518514973dc84d0a95adabd9dc2e7457ab644d9371e0f" ip="100.110.126.6" node="i-0760668425c3c1fb3" pod="machinery/fluentd-mzlhw"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.93815152461a33f6743e00ca9fe1cebece67c56e69f6ccf16fe18bc00474e064" ip="100.110.126.7" node="i-0760668425c3c1fb3" pod="faster-push/push-data-replicator-9dcd889ff-2wsxz"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.102834cc880e307d844381dc1f92bf0c2dde7fc69cd9bb0ee79f33121704e8df" ip="100.110.126.12" node="i-0760668425c3c1fb3" pod="faster-analytics/analytics-data-replicator-b6dd6bf46-4q2th"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.7fb0141f4620168066d170fc0c618e6630d328ad49498d3858cdba5ca3c08471" ip="100.110.126.2" node="i-0760668425c3c1fb3" pod="machinery/consul-consul-gwwcp"
2024-03-26 13:46:18.405 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-0760668425c3c1fb3" ip="100.110.126.0" node="i-0760668425c3c1fb3"
2024-03-26 13:46:18.407 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0120820b172d4611c) with error: nodes "i-0120820b172d4611c" not found
... <hundred of these log lines>
2024-03-26 13:46:18.747 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0098448138d81a356) with error: nodes "i-0098448138d81a356" not found
2024-03-26 13:46:18.888 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-03-26 13:46:18.947 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0ea964a21a4bcdb7d) with error: nodes "i-0ea964a21a4bcdb7d" not found
... <sometimes we can see this resync required>
2024-03-26 13:49:14.547 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-02cd2723dd03173c5) with error: nodes "i-02cd2723dd03173c5" not found
2024-03-26 13:49:14.749 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-0a5040f3b689e13af) with error: nodes "i-0a5040f3b689e13af" not found
2024-03-26 13:49:14.749 [INFO][1] ipam.go 1018: Leaked IP has been resurrected after querying latest state handle="ipip-tunnel-addr-i-0d09b6a916eb08480" ip="100.100.98.64" node="i-0d09b6a916eb08480"
2024-03-26 13:49:14.751 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="k8s-pod-network.e563789abd0adcf5f05173a18f8d4ece2ea753c47e0f328a3ca298b58c2a4b0c" ip="100.124.169.131" node="i-0b324709ab5d81add" pod="machinery/consul-consul-ntrzd"
2024-03-26 13:49:14.761 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="k8s-pod-network.404c07e1524359de0d98cfea71a10f475107b0399f52dab393cfd0c67ad6bce9" ip="100.124.107.146" node="i-034aa99396e8f39bd" pod="driver-profile/schema-driver-api-d8bb5c9d-tfg8c"
2024-03-26 13:49:14.770 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="k8s-pod-network.13629863944ddfdad81495bfd513bc4f455ab4e73ae97926335c174953024bb0" ip="100.104.241.193" node="i-049cf20d32c70118e" pod="machinery/consul-consul-kzgb9" 
... <dozens of these log lines>
2024-03-26 13:51:21.356 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="k8s-pod-network.335522c09993b15d068771451af3a47e02407505ee8ea0e271c1bead91810c4d" ip="100.118.247.205" node="i-0eb8f210e734e5310" pod="user-profile/user-profile-realtime-entity-materializer-546c699769-4b4rp"
2024-03-26 13:51:21.961 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-08cb6f82168bb6788"
2024-03-26 13:51:22.441 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0c57bd5b453789ca3"
... <and again hundreds of the Cleaning up logs>
2024-03-26 13:51:30.638 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0760668425c3c1fb3"
2024-03-26 13:51:31.155 [INFO][1] ipam_block_reader_writer.go 334: Block must be empty but is not empty, refusing to remove affinity. host="i-0760668425c3c1fb3" subnet="100.110.126.0/26"
2024-03-26 13:51:31.437 [ERROR][1] ipam.go 1061: Failed to release block affinities for node calicoNode="i-0760668425c3c1fb3" error=block '100.110.126.0/26' is not empty
2024-03-26 13:51:31.437 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.110.126.0/26' is not empty node="i-0760668425c3c1fb3"
2024-03-26 13:51:31.447 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-0120820b172d4611c"
... <but this time, with much more errors>

At this point we can observe that, although the log reports hundreds of Cleaning ups, we do not see the same number of blocks being deleted via API requests (sometimes, no requests even appear) For example, in 10 minutes (between 13:51:00 and 14:01:00) we count 883 cleaning up log lines (except the error ones). But the ipamblock object metric doesn't decrease, and the API metrics report just a few delete requests.

After some time (and some watch error log), the entire process starts again: node not found + confirmed IP leak -> garbage collecting -> cleaning up. But the objects themselves are no longer deleted:

Screenshot 2024-03-26 at 14 21 37

So we need to delete manually and restart kube-controllers.

Below is one of the blocks that kube-controllers cannot remove because it is not empty:

apiVersion: crd.projectcalico.org/v1
kind: IPAMBlock
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":null}'
  creationTimestamp: "2024-03-26T12:07:15Z"
  generation: 24
  name: 100-121-206-64-26
  resourceVersion: "1366759811"
  uid: 263e7bad-c8e9-4b99-be56-5f842a1d6482
spec:
  affinity: host:i-02efc6aee4c4c7faa
  allocations:
  - 0
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  attributes:
  - handle_id: ipip-tunnel-addr-i-02efc6aee4c4c7faa
    secondary:
      node: i-02efc6aee4c4c7faa
      type: ipipTunnelAddress
  cidr: 100.121.206.64/26
  deleted: false
  sequenceNumber: 1711454835691545660
  sequenceNumberForAllocation:
    "0": 1711454835691545637
    "1": 1711454835691545638
    "2": 1711454835691545639
    "3": 1711454835691545640
    "4": 1711454835691545641
    "5": 1711454835691545642
    "6": 1711454835691545643
    "7": 1711454835691545644
    "8": 1711454835691545647
    "9": 1711454835691545653
    "10": 1711454835691545654
    "11": 1711454835691545655
  strictAffinity: false
  unallocated:
  - 12
  - 13
  - 14
  - 15
  - 16
  - 17
  - 18
  - 19
  - 20
  - 21
  - 22
  - 23
  - 24
  - 25
  - 26
  - 27
  - 28
  - 29
  - 30
  - 31
  - 32
  - 33
  - 34
  - 35
  - 36
  - 37
  - 38
  - 39
  - 40
  - 41
  - 42
  - 43
  - 44
  - 45
  - 46
  - 47
  - 48
  - 49
  - 50
  - 51
  - 52
  - 53
  - 54
  - 55
  - 56
  - 57
  - 58
  - 59
  - 60
  - 61
  - 62
  - 63
  - 3
  - 6
  - 8
  - 1
  - 5
  - 4
  - 2
  - 7
  - 9
  - 10
  - 11

Node was removed:

k get node i-02efc6aee4c4c7faa -o yaml
Error from server (NotFound): nodes "i-02efc6aee4c4c7faa" not found

And the respective ipamhandles:

apiVersion: crd.projectcalico.org/v1
kind: IPAMHandle
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":null}'
  creationTimestamp: "2024-03-26T12:07:15Z"
  generation: 1
  name: ipip-tunnel-addr-i-02efc6aee4c4c7faa
  resourceVersion: "1366677307"
  uid: 8525f5bc-6867-45ca-8f9d-2c74004cf132
spec:
  block:
    100.121.206.64/26: 1
  deleted: false
  handleID: ipip-tunnel-addr-i-02efc6aee4c4c7faa

caseydavenport commented 7 months ago

Thank you for the detailed information. I'll be taking a look at this to look for potential causes and will report back what I find.

caseydavenport commented 7 months ago

@cmotta2016 I was looking through related issues in this area and came across this one: https://github.com/projectcalico/calico/issues/7839

Which reminded me of this pull request: https://github.com/projectcalico/calico/pull/8154

The summary of the issue in that case is that the garbage collection logic was being rate limited to extreme lengths when under heavy load (particularly if there are errors causing retry attempts, which is what I see in the logs you provided).

I have a suspicion that you are encountering the same issue here, and that restring the pod clears the client-side rate limit cache allowing for things to progress until a sufficient number of errors are hit, at which point everything is ground to a halt again.

That said, you said you tried v3.27.2 without success, which sort of puts a damper on this theory.

I assume the logs you provided are from Calico v3.25?
Just to double check - is the installation of Calico managed by tigera/operator or directly via k8s manifests? If the former, it's possible the operator reverted your change to test out v3.27.

caseydavenport commented 7 months ago

I suppose it is probably not exactly that issue, as I can see that we are iterating through the nodes at a rate of about one node per-minute, while the linked PR suggests times of multiple minutes. So perhaps a red herring there.

caseydavenport commented 7 months ago

2024-03-26 12:28:06.840 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-02efc6aee4c4c7faa" ip="100.121.206.64" node="i-02efc6aee4c4c7faa"
2024-03-26 12:31:16.450 [INFO][1] ipam.go 1018: Leaked IP has been resurrected after querying latest state handle="ipip-tunnel-addr-i-02efc6aee4c4c7faa" ip="100.121.206.64" node="i-02efc6aee4c4c7faa"
2024-03-26 13:17:55.930 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"
2024-03-26 13:17:56.645 [INFO][1] ipam_block_reader_writer.go 334: Block must be empty but is not empty, refusing to remove affinity. host="i-02efc6aee4c4c7faa" subnet="100.121.206.64/26"
2024-03-26 13:17:57.131 [ERROR][1] ipam.go 1061: Failed to release block affinities for node calicoNode="i-02efc6aee4c4c7faa" error=block '100.121.206.64/26' is not empty
2024-03-26 13:17:57.131 [WARNING][1] ipam.go 996: Error cleaning up node error=block '100.121.206.64/26' is not empty node="i-02efc6aee4c4c7faa"
2024-03-26 13:46:18.328 [WARNING][1] ipam.go 279: error syncing IPAM data error=block '100.125.34.192/26' is not empty
2024-03-26 13:49:10.947 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-02efc6aee4c4c7faa) with error: nodes "i-02efc6aee4c4c7faa" not found
2024-03-26 13:49:10.947 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="ipip-tunnel-addr-i-02efc6aee4c4c7faa" ip="100.121.206.64" node="i-02efc6aee4c4c7faa"
2024-03-26 13:50:03.154 [INFO][1] ipam.go 1030: Garbage collecting leaked IP address handle="ipip-tunnel-addr-i-02efc6aee4c4c7faa" ip="100.121.206.64" node="i-02efc6aee4c4c7faa"
2024-03-26 13:50:04.761 [INFO][1] ipam.go 1966: Node doesn't exist, no need to release affinity cidr=100.121.206.64/26 host="i-02efc6aee4c4c7faa"
2024-03-26 14:01:39.567 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"
2024-03-26 14:01:57.760 [WARNING][1] ipam.go 279: error syncing IPAM data error=block '100.115.59.128/26' is not empty
2024-03-26 14:02:18.577 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-02efc6aee4c4c7faa) with error: nodes "i-02efc6aee4c4c7faa" not found
2024-03-26 14:11:06.604 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"
2024-03-26 14:25:33.603 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-02efc6aee4c4c7faa) with error: nodes "i-02efc6aee4c4c7faa" not found
2024-03-26 14:37:52.238 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"
2024-03-26 14:43:08.030 [WARNING][1] ipam.go 279: error syncing IPAM data error=block '100.110.168.192/26' is not empty
2024-03-26 14:45:06.843 [INFO][1] ipam.go 1129: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(i-02efc6aee4c4c7faa) with error: nodes "i-02efc6aee4c4c7faa" not found
2024-03-26 14:58:08.675 [INFO][1] ipam.go 993: Cleaning up IPAM resources for deleted node node="i-02efc6aee4c4c7faa"

From this filtered set of logs, I see:

12:28:06: The controller detects that the tunnel IP is leaked.
12:31:16: The controller determines that the tunnel IP is valid again (because the node has returned)
13:17:55: The controller sees that the node has been deleted again.
13:17:56: The controller fails to delete the block, because there is still an IP in use.
13:49:10: Next iteration, controller sees the node is still deleted and needs GC.
13:50:03: Controller claims to successfully release the allocation for the tunnel IP
14:02:18: Controller thinks there is still more to clean up for node i-02efc6aee4c4c7faa
14:11:06: Controller thinks it has successfully cleaned up the node IPAM data.
14:25:33: Controller thinks there is still more to clean up for node i-02efc6aee4c4c7faa

This is puzzling for a few reasons:

The controller thinks that it successfully deleted the IP allocation, but we can see that it still present.
The controller thinks that it successfully cleaned up the node IPAM data, but then keeps trying to clean it up over and over again.

I do wonder if this indicates some issue with the internal state of the controller.

I almost hesitate to ask, but a log with "Debug" level logging might provide the clues that I need. It will also most likely be a LOT of logs, so worth a shot to see if it's possible to grab it. In the meantime I'll keep looking into potential ways the controller might get confused.

cmotta2016 commented 7 months ago

Nice job @caseydavenport, and thanks again.

About your questions:

I assume the logs you provided are from Calico v3.25?

Yes, these logs are from v3.25

is the installation of Calico managed by tigera/operator or directly via k8s manifests?

The installation is managed by kOps (basically this file). The support for Calico v3.27 was introduced to the kOps v1.29 version, but it's in beta. We tried to upgrade our kOps version to test Calico but it broke the cluster (we opened an issue about it).

Because of this limitation, I just tried to upgrade the image of the calico-kube-controllers in another cluster (sandbox) and tried to reproduce the scenario. We saw the same behavior.

Now, we're trying to do:

Bump kOps to v1.29 beta (if we can solve the node up problem)
Change the log level of Calico v3.25 to DEBUG

I will bring updates as we progress.

atmask commented 4 months ago

Following. Also experiencing similar issue in a cluster with lots up and down scale. Restarting the controller allow blocks to get freed again.

atmask commented 4 months ago

Free leaked IPs, Scaling down, and restarting the Controller resolved the issue for me. Running Calico v3.26. I will upgrade to 3.27

projectcalico / calico