We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs.
3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.
The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds.
So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.
a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations
We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs. 3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.
The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds. So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.
a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations
Current Behavior
The difference to 3.24 seems to be that it uses the podcache which was changed in 3.24.6 https://github.com/projectcalico/calico/pull/7503
3.24.5 instead logs the queries for each of the 9000 ipamhandles which takes several minutes:
So the cache is now effective which reduces the load on the apiserver.