projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

Memory leak in calico-kube-controllers - v3.20 #5218

Closed kppullin closed 2 years ago

kppullin commented 3 years ago

Expected Behavior

Reasonable memory usage with no apparent memory leaks.

Current Behavior

calico-kube-controllers v3.20 leaks memory (e.g: 9GB on a 16GB host) running on EKS 1.19. Eventually this leads to pod evictions and alerts firing on the hosting k8s node until the controller pod is recycled.

After dumping all the strings in the process' memory space we see what appears to be the pod specs for all current and formerly running pods are in memory. Why this might impact us especially hard is that we utilize Airflow to run and schedule many frequent jobs in our cluster.

For example, from the calico-kube-controllers memory dump we obtained, we see 304052 copies of the following pod spec snippet:

304052 {"f:metadata":{"f:labels":{".":{},"f:airflow-worker":{},"f:airflow_version":{},"f:app.kubernetes.io/name":{},"f:dag_id":{},"f:execution_date":{},"f:kubernetes_executor":{},"f:task_id":{},"f:try_number":{}}},"f:spec":{"f:affinity":{".":{},"f:nodeAffinity":{".":{},"f:requiredDuringSchedulingIgnoredDuringExecution":{".":{},"f:nodeSelectorTerms":{}}}},"f:containers":{"k:{\"name\":\"base\"}":{".":{},"f:args":{},"f:command":{},"f:env":{".":{},"k:{\"name\":\"AIRFLOW_HOME\"}"....

Note that while Airflow managed pods appear most frequently, we see stale pod specs in the memory dump for non-airflow pods as well.

Steps to Reproduce (for bugs)

Unknown. Per above, possibly by running creating and deleting pods frequently, particularly doing so via Airflow .

Your Environment

Context

If helpful, here's the a recent dump of the last 100 lines of the controller logs, with minor redactions:

2021-09-07 16:47:12.723 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 16:47:13.047 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:47:13.074 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"
2021-09-07 16:47:13.083 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-110.us-west-2.compute.internal" error=block '100.127.36.192/26' is not empty
2021-09-07 16:47:13.083 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.127.36.192/26' is not empty node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:47:13.723 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:47:13.742 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-224.us-west-2.compute.internal" subnet="100.87.50.128/26"
2021-09-07 16:47:13.750 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-224.us-west-2.compute.internal" error=block '100.87.50.128/26' is not empty
2021-09-07 16:47:13.750 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.87.50.128/26' is not empty node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:47:15.030 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:47:15.051 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-36-15.us-west-2.compute.internal" subnet="100.80.22.128/26"
2021-09-07 16:47:15.063 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-36-15.us-west-2.compute.internal" error=block '100.80.22.128/26' is not empty
2021-09-07 16:47:15.063 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.80.22.128/26' is not empty node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:47:15.063 [WARNING][1] ipam.go 252: error syncing IPAM data error=block '100.80.22.128/26' is not empty
2021-09-07 16:47:15.103 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-224.us-west-2.compute.internal) with error: nodes "ip-x-x-32-224.us-west-2.compute.internal" not found
2021-09-07 16:47:15.123 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-36-15.us-west-2.compute.internal) with error: nodes "ip-x-x-36-15.us-west-2.compute.internal" not found
2021-09-07 16:47:15.472 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-110.us-west-2.compute.internal) with error: nodes "ip-x-x-32-110.us-west-2.compute.internal" not found
2021-09-07 16:47:16.869 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.ee250e358a497d80fcf7d5b712be367f9b21136d548b21438296178712017ddd" ip="100.80.22.186" node="ip-x-x-36-15.us-west-2.compute.internal" pod="data/xxx-50ccb6ff77e84009972b09c060d70309"
2021-09-07 16:47:17.069 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 16:47:17.267 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.42fa01cdc2d9dd1aceb1bf810d29b26b2427a373a1935a6804ea4ff202208725" ip="100.87.50.140" node="ip-x-x-32-224.us-west-2.compute.internal" pod="data/xxx-585b1906fcf747dd97502bd0c1c8d46b"
2021-09-07 16:47:19.837 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:47:19.853 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-224.us-west-2.compute.internal" subnet="100.87.50.128/26"
2021-09-07 16:47:19.862 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-224.us-west-2.compute.internal" error=block '100.87.50.128/26' is not empty
2021-09-07 16:47:19.862 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.87.50.128/26' is not empty node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:47:24.982 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:47:25.000 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-36-15.us-west-2.compute.internal" subnet="100.80.22.128/26"
2021-09-07 16:47:25.010 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-36-15.us-west-2.compute.internal" error=block '100.80.22.128/26' is not empty
2021-09-07 16:47:25.011 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.80.22.128/26' is not empty node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:47:35.251 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:47:35.268 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"
2021-09-07 16:47:35.277 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-110.us-west-2.compute.internal" error=block '100.127.36.192/26' is not empty
2021-09-07 16:47:35.277 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.127.36.192/26' is not empty node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:47:35.277 [WARNING][1] ipam.go 252: error syncing IPAM data error=block '100.127.36.192/26' is not empty
2021-09-07 16:47:35.287 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-110.us-west-2.compute.internal) with error: nodes "ip-x-x-32-110.us-west-2.compute.internal" not found
2021-09-07 16:47:35.302 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-36-15.us-west-2.compute.internal) with error: nodes "ip-x-x-36-15.us-west-2.compute.internal" not found
2021-09-07 16:47:35.319 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-224.us-west-2.compute.internal) with error: nodes "ip-x-x-32-224.us-west-2.compute.internal" not found
2021-09-07 16:47:37.086 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 16:47:37.281 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.42fa01cdc2d9dd1aceb1bf810d29b26b2427a373a1935a6804ea4ff202208725" ip="100.87.50.140" node="ip-x-x-32-224.us-west-2.compute.internal" pod="data/xxx-585b1906fcf747dd97502bd0c1c8d46b"
2021-09-07 16:47:37.483 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.ee250e358a497d80fcf7d5b712be367f9b21136d548b21438296178712017ddd" ip="100.80.22.186" node="ip-x-x-36-15.us-west-2.compute.internal" pod="data/xxx-50ccb6ff77e84009972b09c060d70309"
2021-09-07 16:47:57.969 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:47:57.985 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"
2021-09-07 16:47:57.995 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-110.us-west-2.compute.internal" error=block '100.127.36.192/26' is not empty
2021-09-07 16:47:57.995 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.127.36.192/26' is not empty node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:48:38.955 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:48:38.973 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-36-15.us-west-2.compute.internal" subnet="100.80.22.128/26"
2021-09-07 16:48:38.989 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-36-15.us-west-2.compute.internal" error=block '100.80.22.128/26' is not empty
2021-09-07 16:48:38.989 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.80.22.128/26' is not empty node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 16:50:00.909 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:50:00.928 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-224.us-west-2.compute.internal" subnet="100.87.50.128/26"
2021-09-07 16:50:00.938 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-224.us-west-2.compute.internal" error=block '100.87.50.128/26' is not empty
2021-09-07 16:50:00.938 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.87.50.128/26' is not empty node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:50:00.938 [WARNING][1] ipam.go 252: error syncing IPAM data error=block '100.87.50.128/26' is not empty
2021-09-07 16:50:00.942 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-110.us-west-2.compute.internal) with error: nodes "ip-x-x-32-110.us-west-2.compute.internal" not found
2021-09-07 16:50:01.353 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-224.us-west-2.compute.internal) with error: nodes "ip-x-x-32-224.us-west-2.compute.internal" not found
2021-09-07 16:50:01.757 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-36-15.us-west-2.compute.internal) with error: nodes "ip-x-x-36-15.us-west-2.compute.internal" not found
2021-09-07 16:50:02.749 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 16:50:02.947 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.42fa01cdc2d9dd1aceb1bf810d29b26b2427a373a1935a6804ea4ff202208725" ip="100.87.50.140" node="ip-x-x-32-224.us-west-2.compute.internal" pod="data/xxx-585b1906fcf747dd97502bd0c1c8d46b"
2021-09-07 16:50:03.146 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.ee250e358a497d80fcf7d5b712be367f9b21136d548b21438296178712017ddd" ip="100.80.22.186" node="ip-x-x-36-15.us-west-2.compute.internal" pod="data/xxx-50ccb6ff77e84009972b09c060d70309"
2021-09-07 16:52:46.994 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:52:47.013 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"
2021-09-07 16:52:47.022 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-110.us-west-2.compute.internal" error=block '100.127.36.192/26' is not empty
2021-09-07 16:52:47.022 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.127.36.192/26' is not empty node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 16:58:14.702 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 16:58:14.720 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-224.us-west-2.compute.internal" subnet="100.87.50.128/26"
2021-09-07 16:58:14.727 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-224.us-west-2.compute.internal" error=block '100.87.50.128/26' is not empty
2021-09-07 16:58:14.727 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.87.50.128/26' is not empty node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 17:09:10.087 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 17:09:10.106 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-36-15.us-west-2.compute.internal" subnet="100.80.22.128/26"
2021-09-07 17:09:10.114 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-36-15.us-west-2.compute.internal" error=block '100.80.22.128/26' is not empty
2021-09-07 17:09:10.114 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.80.22.128/26' is not empty node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 17:09:10.114 [WARNING][1] ipam.go 252: error syncing IPAM data error=block '100.80.22.128/26' is not empty
2021-09-07 17:09:10.441 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-110.us-west-2.compute.internal) with error: nodes "ip-x-x-32-110.us-west-2.compute.internal" not found
2021-09-07 17:09:10.524 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-224.us-west-2.compute.internal) with error: nodes "ip-x-x-32-224.us-west-2.compute.internal" not found
2021-09-07 17:09:11.125 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-36-15.us-west-2.compute.internal) with error: nodes "ip-x-x-36-15.us-west-2.compute.internal" not found
2021-09-07 17:09:11.921 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 17:09:12.126 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.42fa01cdc2d9dd1aceb1bf810d29b26b2427a373a1935a6804ea4ff202208725" ip="100.87.50.140" node="ip-x-x-32-224.us-west-2.compute.internal" pod="data/xxx-585b1906fcf747dd97502bd0c1c8d46b"
2021-09-07 17:09:12.319 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.ee250e358a497d80fcf7d5b712be367f9b21136d548b21438296178712017ddd" ip="100.80.22.186" node="ip-x-x-36-15.us-west-2.compute.internal" pod="data/xxx-50ccb6ff77e84009972b09c060d70309"
2021-09-07 17:25:52.324 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 17:25:52.342 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"
2021-09-07 17:25:52.351 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-110.us-west-2.compute.internal" error=block '100.127.36.192/26' is not empty
2021-09-07 17:25:52.351 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.127.36.192/26' is not empty node="ip-x-x-32-110.us-west-2.compute.internal"
2021-09-07 17:29:04.560 [INFO][1] resources.go 377: Terminating main client watcher loop
2021-09-07 17:29:04.574 [INFO][1] resources.go 349: Main client watcher loop
2021-09-07 17:34:37.742 [INFO][1] watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/ipam/v2/assignment/"
2021-09-07 17:42:32.351 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 17:42:32.372 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-224.us-west-2.compute.internal" subnet="100.87.50.128/26"
2021-09-07 17:42:32.380 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-32-224.us-west-2.compute.internal" error=block '100.87.50.128/26' is not empty
2021-09-07 17:42:32.380 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.87.50.128/26' is not empty node="ip-x-x-32-224.us-west-2.compute.internal"
2021-09-07 17:45:39.019 [INFO][1] watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-09-07 17:59:12.380 [INFO][1] ipam.go 871: Cleaning up IPAM resources for deleted node node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 17:59:12.397 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-36-15.us-west-2.compute.internal" subnet="100.80.22.128/26"
2021-09-07 17:59:12.407 [ERROR][1] ipam.go 930: Failed to release block affinities for node calicoNode="ip-x-x-36-15.us-west-2.compute.internal" error=block '100.80.22.128/26' is not empty
2021-09-07 17:59:12.407 [WARNING][1] ipam.go 874: Error cleaning up node error=block '100.80.22.128/26' is not empty node="ip-x-x-36-15.us-west-2.compute.internal"
2021-09-07 17:59:12.407 [WARNING][1] ipam.go 242: Periodic IPAM sync failed error=block '100.80.22.128/26' is not empty
2021-09-07 17:59:12.418 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-36-15.us-west-2.compute.internal) with error: nodes "ip-x-x-36-15.us-west-2.compute.internal" not found
2021-09-07 17:59:12.446 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-224.us-west-2.compute.internal) with error: nodes "ip-x-x-32-224.us-west-2.compute.internal" not found
2021-09-07 17:59:13.017 [INFO][1] ipam.go 966: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-x-x-32-110.us-west-2.compute.internal) with error: nodes "ip-x-x-32-110.us-west-2.compute.internal" not found
2021-09-07 17:59:14.211 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.ee250e358a497d80fcf7d5b712be367f9b21136d548b21438296178712017ddd" ip="100.80.22.186" node="ip-x-x-36-15.us-west-2.compute.internal" pod="data/xxx-50ccb6ff77e84009972b09c060d70309"
2021-09-07 17:59:14.413 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"
2021-09-07 17:59:14.614 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.42fa01cdc2d9dd1aceb1bf810d29b26b2427a373a1935a6804ea4ff202208725" ip="100.87.50.140" node="ip-x-x-32-224.us-west-2.compute.internal" pod="data/xxx-585b1906fcf747dd97502bd0c1c8d46b"
caseydavenport commented 3 years ago

@kppullin thanks for the bug report - just checking, did you see this behavior on Calico v3.19 or is this newly introduced in v3.20 (in your experience).

kppullin commented 3 years ago

Oh, good question... we've just recently converted our "legacy" calico deployment to use the operator, so we've only been using v3.20. Let me try running v3.19 in one of our non-production clusters and see what happens.

kppullin commented 3 years ago

After switching versions to v3.19.1 I'm no longer seeing the memory leak. I'll try switching to this version in our other clusters and see if the leak does indeed stop.

caseydavenport commented 3 years ago

@kppullin by chance would you be able to share the full memory dump that you captured? Could do it privately via DM if you'd like. I am hoping it might help me pinpoint where the leak is occurring.

if not I'll try to reproduce on my own.

caseydavenport commented 3 years ago

Looks at your logs, it does seem the the controller is having a hard time. It seems to think that a handful of nodes should be cleaned up, but then is surprised when they are still in-use.

Do these nodes actually exist in your cluster? e.g., do those names show up in kubectl get nodes?

kppullin commented 3 years ago

re: missing nodes - Those nodes are not currently active and I'm unable to find reference to them in our logs, so if they did exist at some point in time, it's been a long time since they've been seen.

re: memory dump - The dump I have was captured using gcore. Will this work for you or do you need something more platform specific? (I've tried using various viewcore builds with no success, can't seem to get much useful data from delve in this case, and haven't run across a method to make it compatible w/ pprof tooling).

If you need some other form I can reproduce in one of our clusters, but may need some hints on grabbing it in a useful form (I see some other projects in the org have pprof support but didn't run across anything in kube-controllers).

caseydavenport commented 3 years ago

Those nodes are not currently active and I'm unable to find reference to them in our logs, so if they did exist at some point in time, it's been a long time since they've been seen.

Hm, interesting. So it seems that the controller is identifying that the nodes are gone and resources need to be released, but is then failing to do so because it sees one or more IPs on that node that it thinks are still valid.

The dump I have was captured using gcore. Will this work for you or do you need something more platform specific? (I've tried using various viewcore builds with no success, can't seem to get much useful data from delve in this case, and haven't run across a method to make it compatible w/ pprof tooling).

Hmm, I forgot that we haven't yet added pprof to kube-controllers. I am not very familiar with gcore, from a skim it sounds like it might be hard to get much useful from it here.

Pulling on another thread for now...

2021-09-07 16:47:13.074 [INFO][1] ipam_block_reader_writer.go 300: Block must be empty but is not empty, refusing to remove affinity. host="ip-x-x-32-110.us-west-2.compute.internal" subnet="100.127.36.192/26"

Could you use kubectl to get the IPAM block referenced here? It should be something like this:

kubectl get ipamblocks -o yaml 100-127-36-192-26

That is one of the blocks the controller thinks belongs to a nonexistent node, but that still has valid allocations in it.

kppullin commented 3 years ago

The full output for the requested ipamblock is below.

To also note, in this particular cluster, there are 16 active nodes and 19 ipamblocks. The "unknown" hosts have ages of 186d, 311d, and 131d, whereas all other nodes are 41d or less (we rotated all nodes during a recent EKS upgrade).

❯ kubectl get ipamblocks -o yaml 100-127-36-192-26
apiVersion: crd.projectcalico.org/v1
kind: IPAMBlock
metadata:
  annotations:
    projectcalico.org/metadata: '{"creationTimestamp":null}'
  creationTimestamp: "2021-03-18T20:31:28Z"
  generation: 797392
  name: 100-127-36-192-26
  resourceVersion: "325023079"
  selfLink: /apis/crd.projectcalico.org/v1/ipamblocks/100-127-36-192-26
  uid: 0248a420-f6b7-496a-a8cb-1944c18db930
spec:
  affinity: host:ip-x-x-32-110.us-west-2.compute.internal
  allocations:
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - null
  - 0
  - null
  - null
  attributes:
  - handle_id: k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8
    secondary:
      namespace: data
      node: ip-x-x-32-110.us-west-2.compute.internal
      pod: <airflow-short-lived-task-pod-name>-3d283c8214c544d4bf21ea09f6b4aad3
  cidr: 100.127.36.192/26
  deleted: false
  strictAffinity: false
  unallocated:
  - 32
  - 33
  - 41
  - 5
  - 54
  - 28
  - 59
  - 40
  - 6
  - 50
  - 7
  - 26
  - 60
  - 16
  - 47
  - 36
  - 46
  - 19
  - 56
  - 48
  - 43
  - 18
  - 15
  - 53
  - 34
  - 30
  - 29
  - 22
  - 2
  - 8
  - 21
  - 0
  - 1
  - 3
  - 4
  - 9
  - 10
  - 11
  - 12
  - 13
  - 14
  - 17
  - 20
  - 23
  - 24
  - 25
  - 27
  - 31
  - 35
  - 37
  - 38
  - 39
  - 42
  - 44
  - 45
  - 49
  - 51
  - 52
  - 55
  - 57
  - 58
  - 62
  - 63
caseydavenport commented 3 years ago

2021-09-07 17:59:14.413 [INFO][1] ipam.go 902: Garbage collecting leaked IP address handle="k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8" ip="100.127.36.253" node="ip-x-x-32-110.us-west-2.compute.internal" pod="data/xxx-3d283c8214c544d4bf21ea09f6b4aad3"

Interesting. In the logs you posted, I see the controller attempting to release what it believes is a leaked IP address (and from what you say, is indeed a leaked IP). However, the block still has that IP address allocated.

The logs don't indicate any error releasing the IP. That suggests to me that something bad happened to that allocation a long time ago that resulted in an inconsistent state.

If that were the case, I would expect the corrupted state to result in this command returning 0 results:

kubectl get ipamhandle k8s-pod-network.fc90c4e5f824e2cf08399b4cdc80d927c21837b84b55abef918708b86eeb05c8

So, I would guess you're hitting this branch but the log isn't showing up because it is "Debug" level: https://github.com/projectcalico/kube-controllers/blob/c3ec2f7d6fb7554ad9c6e239b85e36d0eea3607e/pkg/controllers/node/ipam.go#L906-L907

Given those nodes are long gone, and the IP allocations within them appear to be corrupt at this point, the best course of action is likely to delete the blocks altogether (which is what the controller is trying to do, but it is subsequently confused by the corrupt state).

e.g.

kubectl delete ipamblock 100-127-36-192-26

This should get rid of the cyclic error / warning logs in kube-controllers after you do it for each of the three leaked blocks.

I would be interested to know if that hides the memory growth - if it does, then it suggests that some memory is not being properly let go as a result of hitting that branch.

kppullin commented 3 years ago

I went ahead and switched back to v1.20.0 in a couple of our clusters, one with extra unknown/stale ipamblocks and one without, and running similar workloads (both spawning frequently running, short-lived, pods) . The cluster with an ipamblock did leak memory, while the other did not.

I then deleted the unknown, extra ipamblock. The process continued to leak memory until it was restarted. After restarting the process is no longer leaking memory.

salanki commented 2 years ago

We are seeing similar memory leak issues in 3.21.4 still

caseydavenport commented 2 years ago

I think this is the same as this issue: https://github.com/projectcalico/calico/issues/5037#issuecomment-1000436879

I've been unable to reproduce myself, but per the linked comment I have produced a debug image with pprof enabled (this is also now available in v3.22.0) which I hope will give some insight into where the memory is being leaked.

If anyone is hitting this and can grab one of those profiles, that would be a help, since I cannot seem to reproduce this issue myself locally.

shcherbak commented 2 years ago

etcd 3.5.2, kubernetes 1.22.6, debian buster, kernel 5.10, calico 3.21.4 the same memory leak issue

shcherbak commented 2 years ago

etcd 3.5.2, kubernetes 1.22.6, debian buster, kernel 5.10, calico 3.22.1 the same memory leak issue

shcherbak commented 2 years ago

we have maxPods: 100 in kubelet config and default calico ipam block /26 (64 addresses). Today I reconfigured ipam block to \24 (the value is greater then maxPods) and calico-kube-controller stops leaking. I will observer the cluster behavior and give the update tomorrow, after several re-deploys of application

The update: I can confirm that I can not reproduce memory leak when CALICO_IPV4POOL_BLOCK_SIZE is greater then maxPods kubelet option. So, i suppose, that memory leaks when calico node uses more then one IPAM block

shcherbak commented 2 years ago

@kppullin can you please provide you maxPods value of kubelet and calico CALICO_IPV4POOL_BLOCK_SIZE ?

primeroz commented 2 years ago

Seeing the same issue on a kubeadm 1.20.15 cluster with calico doing both CNI and network policy - Calico version 3.21.4

I have deleted the 2 stale ipamblocks and will check if the memory leak goes away.

Side Note i use the same version of Calico on some EKS clusters only for the NetworkPolicy part using "AWSVPCCNI` and i don't see the same memory leak


EDIT: Removing the 2 stale ipamblocks seems to have fixed the memory leak

ml14tlc commented 2 years ago

Hi all,

I deployed Calico 3.21.5 via the operator helm chart on a Kuberentes 1.22.6 cluster (one control plane node, 2 workers, all based on Ubuntu 20.04.4) and the memory usage of the calico-kube-controllers pod grows without end (all the rest behaves as expected). I'm using VXLAN and eBPF, the MTU is 1350 (i.e., 1400 - 50) and I disabled kube-proxy.

As far as I can see the IPAM blocks are healthy. The kubelet can schedule up to 110 pods per node, though, the block size is /26. A single block doesn't cover the maximum number of pods on a node, however, I'm below that limit

 calicoctl ipam show --show-blocks
+----------+-------------------+-----------+------------+--------------+
| GROUPING |       CIDR        | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+-------------------+-----------+------------+--------------+
| IP Pool  | 10.244.0.0/16     |     65536 | 26 (0%)    | 65510 (100%) |
| Block    | 10.244.132.192/26 |        64 | 20 (31%)   | 44 (69%)     |
| Block    | 10.244.216.64/26  |        64 | 1 (2%)     | 63 (98%)     |
| Block    | 10.244.35.128/26  |        64 | 5 (8%)     | 59 (92%)     |
+----------+-------------------+-----------+------------+--------------+

I also ran calicotcl ipam check --show-problem-ips and restarted the pods with an IP that didn't belong to any block.

According to the documentation

If a host allocates all of the addresses in a block then it will be allocated an additional block. If there are no more blocks available then the host can take addresses from blocks allocated to other hosts

If I understand correctly, I don't need to resize the block size to /25 or /24 to cover kubelet's maxPods.

The installations.operator.tigera.io:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  annotations:
    meta.helm.sh/release-name: tigera-operator
    meta.helm.sh/release-namespace: tigera-operator
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: default
spec:
  calicoNetwork:
    bgp: Disabled
    hostPorts: Disabled
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLAN
      natOutgoing: Enabled
      nodeSelector: all()
    linuxDataplane: BPF
    mtu: 1350
    multiInterfaceMode: None
    nodeAddressAutodetectionV4:
      firstFound: true
  cni:
    ipam:
      type: Calico
    type: Calico
  componentResources:
  - componentName: Node
    resourceRequirements:
      requests:
        cpu: 250m
  controlPlaneReplicas: 2
  flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  nodeUpdateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  nonPrivileged: Disabled
  variant: Calico

The felixconfigurations.crd.projectcalico.org:

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  bpfKubeProxyIptablesCleanupEnabled: false
  bpfLogLevel: ""
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

In the calico-kube-controlles log, the following messages are repeated every few seconds:

2022-04-20 21:46:51.310 [INFO][1] resources.go 350: Main client watcher loop
2022-04-20 21:46:51.310 [DEBUG][1] watcher.go 94: Kubernetes watcher/converter started resource="KubeControllersConfiguration (custom)"
2022-04-20 21:46:51.312 [DEBUG][1] watcher.go 122: Kubernetes event converted and sent to backend watcher resource="KubeControllersConfiguration (custom)"
2022-04-20 21:46:52.279 [DEBUG][1] k8s.go 628: Performing 'Get' for ClusterInformation(default)
2022-04-20 21:46:52.280 [DEBUG][1] customresource.go 205: Get custom Kubernetes resource Key=ClusterInformation(default) Resource="ClusterInformations" Revision=""
2022-04-20 21:46:52.280 [DEBUG][1] customresource.go 216: Get custom Kubernetes resource by name Key=ClusterInformation(default) Name="default" Namespace="" Resource="ClusterInformations" Revision=""
2022-04-20 21:46:52.352 [DEBUG][1] client.go 294: Cluster GUID value already set ClusterGUID="0b07c578c972416c9009317bfec5d48a"
2022-04-20 21:46:52.352 [DEBUG][1] client.go 304: DatastoreReady value already set DatastoreReady=(*bool)(0xc000da427a)
2022-04-20 21:46:55.284 [DEBUG][1] watcher.go 108: Watcher channel closed by remote resource="IPAMBlock (custom)"
2022-04-20 21:46:55.284 [DEBUG][1] watcher.go 96: Kubernetes watcher/converter stopped, closing result channel resource="IPAMBlock (custom)"
2022-04-20 21:46:55.284 [INFO][1] watchercache.go 97: Watch channel closed by remote - recreate watcher ListRoot="/calico/ipam/v2/assignment/"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 155: Starting watch sync/resync processing ListRoot="/calico/ipam/v2/assignment/"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 259: Stopping previous watcher ListRoot="/calico/ipam/v2/assignment/"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 171: Starting main resync loop ListRoot="/calico/ipam/v2/assignment/"
2022-04-20 21:46:55.284 [DEBUG][1] k8s.go 658: Performing 'Watch' for {IPVersion:0} model.BlockListOptions
2022-04-20 21:46:55.284 [DEBUG][1] watcher.go 108: Watcher channel closed by remote resource="Node"
2022-04-20 21:46:55.284 [DEBUG][1] watcher.go 96: Kubernetes watcher/converter stopped, closing result channel resource="Node"
2022-04-20 21:46:55.284 [INFO][1] watchercache.go 97: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 155: Starting watch sync/resync processing ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 259: Stopping previous watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-20 21:46:55.284 [DEBUG][1] watchercache.go 171: Starting main resync loop ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-20 21:46:55.285 [DEBUG][1] k8s.go 658: Performing 'Watch' for Node model.ResourceListOptions
2022-04-20 21:46:55.285 [DEBUG][1] watcher.go 108: Watcher channel closed by remote resource="KubeControllersConfiguration (custom)"
2022-04-20 21:46:55.285 [DEBUG][1] watcher.go 96: Kubernetes watcher/converter stopped, closing result channel resource="KubeControllersConfiguration (custom)"
2022-04-20 21:46:55.285 [DEBUG][1] resources.go 359: Watcher results channel closed by remote
2022-04-20 21:46:55.285 [INFO][1] resources.go 378: Terminating main client watcher loop
2022-04-20 21:46:55.285 [DEBUG][1] validator.go 377: Validate Felix log level: Debug
2022-04-20 21:46:55.285 [DEBUG][1] validator.go 377: Validate Felix log level: Debug
2022-04-20 21:46:55.285 [DEBUG][1] k8s.go 553: Performing 'Update' for &{Key:KubeControllersConfiguration(default) Value:0xc00000d860 Revision:4720424 UID:<nil> TTL:0s}
2022-04-20 21:46:55.285 [DEBUG][1] customresource.go 125: Update resource by name Key=KubeControllersConfiguration(default) Name="default" Resource="KubeControllersConfigurations" Value=&v3.KubeControllersConfiguration{TypeMeta:v1.TypeMeta{Kind:"KubeControllersConfiguration", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"default", GenerateName:"", Namespace:"", SelfLink:"", UID:"1c80cdec-1dc0-4788-abf1-7fb24f789e70", ResourceVersion:"4720424", Generation:0, CreationTimestamp:time.Date(2022, time.April, 10, 23, 15, 28, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.KubeControllersConfigurationSpec{LogSeverityScreen:"Debug", HealthChecks:"Enabled", EtcdV3CompactionPeriod:(*v1.Duration)(0xc00062ffe0), PrometheusMetricsPort:(*int)(0xc000da4000), Controllers:v3.ControllersConfig{Node:(*v3.NodeControllerConfig)(0xc0005a3530), Policy:(*v3.PolicyControllerConfig)(0xc0008db108), WorkloadEndpoint:(*v3.WorkloadEndpointControllerConfig)(0xc0008db128), ServiceAccount:(*v3.ServiceAccountControllerConfig)(0xc0008db118), Namespace:(*v3.NamespaceControllerConfig)(0xc0008db0e8)}}, Status:v3.KubeControllersConfigurationStatus{RunningConfig:v3.KubeControllersConfigurationSpec{LogSeverityScreen:"Debug", HealthChecks:"Enabled", EtcdV3CompactionPeriod:(*v1.Duration)(0xc000da45b0), PrometheusMetricsPort:(*int)(nil), Controllers:v3.ControllersConfig{Node:(*v3.NodeControllerConfig)(0xc0006b2720), Policy:(*v3.PolicyControllerConfig)(nil), WorkloadEndpoint:(*v3.WorkloadEndpointControllerConfig)(nil), ServiceAccount:(*v3.ServiceAccountControllerConfig)(nil), Namespace:(*v3.NamespaceControllerConfig)(nil)}}, EnvironmentVars:map[string]string{"DATASTORE_TYPE":"kubernetes", "ENABLED_CONTROLLERS":"node"}}}
2022-04-20 21:46:55.312 [DEBUG][1] watchercache.go 251: Resync completed, now watching for change events ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2022-04-20 21:46:55.312 [DEBUG][1] watchercache.go 251: Resync completed, now watching for change events ListRoot="/calico/ipam/v2/assignment/"
2022-04-20 21:46:55.312 [DEBUG][1] watcher.go 94: Kubernetes watcher/converter started resource="IPAMBlock (custom)"
2022-04-20 21:46:55.312 [DEBUG][1] watcher.go 94: Kubernetes watcher/converter started resource="Node"
2022-04-20 21:46:55.346 [DEBUG][1] k8s.go 643: Performing 'List' for KubeControllersConfiguration model.ResourceListOptions
2022-04-20 21:46:55.346 [DEBUG][1] customresource.go 239: Received List request ListInterface=KubeControllersConfiguration Resource="KubeControllersConfigurations" Type="CustomResource"
2022-04-20 21:46:55.346 [DEBUG][1] customresource.go 245: Performing List using Get ListInterface=KubeControllersConfiguration Resource="KubeControllersConfigurations" Type="CustomResource"
2022-04-20 21:46:55.346 [DEBUG][1] customresource.go 205: Get custom Kubernetes resource Key=KubeControllersConfiguration(default) Resource="KubeControllersConfigurations" Revision=""
2022-04-20 21:46:55.346 [DEBUG][1] customresource.go 216: Get custom Kubernetes resource by name Key=KubeControllersConfiguration(default) Name="default" Namespace="" Resource="KubeControllersConfigurations" Revision=""
2022-04-20 21:46:55.352 [DEBUG][1] k8s.go 658: Performing 'Watch' for KubeControllersConfiguration model.ResourceListOptions
2022-04-20 21:46:55.354 [INFO][1] resources.go 350: Main client watcher loop
ml14tlc commented 2 years ago

Update: the above issued is solved. It was due to a misconfiguration of the load balancer to the Kubernetes API. Connections were being dropped, probably the Calico controller couldn't end its main loop, it kept starting new ones, yielding to a memory leak.

cloud-66 commented 2 years ago

the same problem, calico-kube-controllers restarts every 25-30 minutes due to OOMkiller (1G Memory limit) cluster kubernetes 1.20.9 with calico v3.20.3 installed by kubespray. How it can be solved by load balancer configuratiion ? As a balancer in kubespray installation uses nginx static container runned on every worker node

image

primeroz commented 2 years ago

the balancer problem might have been the cause for that user but is not the real issue, i am still having OOM and memory leak without the same misconfiguration the other user has reported.

I am just hoping it goes away on upgrades

cloud-66 commented 2 years ago

I am just hoping it goes away on upgrades

which version do you use? there is offer to update to v3.22 where should be some fixes of memory using #5037

huangjiasingle commented 2 years ago

v3.22

calico 3.22.1 the same memory leak issue

huangjiasingle commented 2 years ago

Update: the above issued is solved. It was due to a misconfiguration of the load balancer to the Kubernetes API. Connections were being dropped, probably the Calico controller couldn't end its main loop, it kept starting new ones, yielding to a memory leak.

which version fixed this issue?

huangjiasingle commented 2 years ago

@caseydavenport calico 3.22.1. l think the reason is calico-kube-controller's IPAMController's podCache filed lead to the memory increase,why? the IPAMController's podCache filed use map[string]*v1.Pod, in my prod env, every day create less than 500000~1000000 data process task(use argo workflow),so over time, more and more memory increase, only 5~7 day, the calico-kube-controller will use more than 180Gi memory. when use go's map to store large num entry , even use delete the key from map, the map allocated memory will not be garbage. so l think IPAMController's podCache should use podInformer to instead of or use another data struct to store the pod in ipam controller pod cache.

caseydavenport commented 2 years ago

, even use delete the key from map, the map allocated memory will not be garbage. so l think IPAMController's podCache should use podInformer to instead of or use another data struct to store the pod in ipam controller pod cache

It does seem plausible that we're hitting a leak within the implementation of golang maps, namely that we might be resulting in unused hash buckets as a result of high churn in the entries within the map.

I wonder how the pod informer handles this. Perhaps it periodically copies / recreates the map, which is the only solution I can see off the top of my head.

chaewoorha commented 1 year ago

Hello. Guys.

I am installing Calico v3.24.3 and seeing this OOM killer invoked by calico-ipam. So just wanted to see whether calico-ipam memory leak has still been seen since v3.20

[>4;mroot@array114-c2:~# dmesg | grep oom
[11403.505124] calico-ipam invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=999
[11403.545930]  oom_kill_process.cold+0xb/0x10
[11403.666436] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[11403.666453] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task=calico,pid=156702,uid=0
[11403.666523] Memory cgroup out of memory: Killed process 156702 (calico) total-vm:745976kB, anon-rss:17020kB, file-rss:25064kB, shmem-rss:0kB, UID:0 pgtables:204kB oom_score_adj:999
[11407.402385] calico-ipam invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=999
[11407.416249]  oom_kill_process.cold+0xb/0x10
[11407.568225] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[11407.568246] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task=calico,pid=156810,uid=0
[11407.568307] Memory cgroup out of memory: Killed process 156810 (calico) total-vm:745464kB, anon-rss:15376kB, file-rss:25000kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:999
[11411.003996] calico-ipam invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=999
[11411.044750]  oom_kill_process.cold+0xb/0x10
[11411.174588] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[11411.174602] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8e62c259_dbc5_49fa_ab7b_bd20409b71cd.slice/cri-containerd-1c2d7115b98b83b320c45b7c724265af4ee9ba5350c4f765a959e71b7cdba9a0.scope,task=multus-daemon,pid=32382,uid=0
[11411.174665] Memory cgroup out of memory: Killed process 32382 (multus-daemon) total-vm:748996kB, anon-rss:16580kB, file-rss:23012kB, shmem-rss:0kB, UID:0 pgtables:180kB oom_score_adj:999
kppullin commented 1 year ago

I've not seen the OOM for quite some time now. We're running v3.26.1 currently so can't comment on v3.24.3 specifically.

zhucan commented 1 month ago

etcd 3.4.27, kubernetes 1.23.17, debian buster, kernel 5.4.0, calico 3.25.1 the same memory leak issue