projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.89k stars 1.31k forks source link

calico-kube-controller 3.26 high cpu usage frequent ipam syncs #7841

Open juliantaylor opened 1 year ago

juliantaylor commented 1 year ago

We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs. 3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.

The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds. So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.

a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations

         0     0%     0%    159.53s 67.54%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).acceptScheduleRequests
     0.64s  0.27%  0.27%    153.88s 65.15%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).checkAllocations
         0     0%  0.27%    153.88s 65.15%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).syncIPAM
     1.44s  0.61%  0.88%    129.67s 54.90%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).allocationIsValid
         0     0%  0.88%     99.52s 42.13%  github.com/projectcalico/calico/libcalico-go/lib/backend/k8s/conversion.defaultWorkloadEndpointConverter.PodToWorkloadEndpoints
     8.49s  3.59%  4.48%     99.52s 42.13%  github.com/projectcalico/calico/libcalico-go/lib/backend/k8s/conversion.defaultWorkloadEndpointConverter.podToDefaultWorkloadEndpoint

Current Behavior

grep -E "Synchronizing IPAM|sync com|Triggering sync after batch of updates "
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:54.246 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:57.021 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:57.026 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:08.677 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=1
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:08.677 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.352 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.949 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.794 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.709 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.466 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.954 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 968: Synchronizing IPAM data

The difference to 3.24 seems to be that it uses the podcache which was changed in 3.24.6 https://github.com/projectcalico/calico/pull/7503

3.24.5 instead logs the queries for each of the 9000 ipamhandles which takes several minutes:

calico-kube-controllers-bc987cddc-vr4gk calico-kube-controllers 2023-07-05 16:20:28.566 [DEBUG][1] ipam.go 1008: Synchronizing IPAM data
calico-kube-controllers-bc987cddc-vr4gk calico-kube-controllers 2023-07-05 16:20:28.572 [DEBUG][1] ipam.go 920: Querying Kubernetes API for pod handle="k8s-pod-network.9...
.

So the cache is now effective which reduces the load on the apiserver.

caseydavenport commented 1 year ago

Sounds like we pushed one bottleneck somewhere else :sweat_smile:

perhaps we need some better internal ratelimiting / batching to make sure we're not eating up too much resource at once.