Open gongzixiangyuan opened 11 months ago
@coutinhop Is there any room for optimization in this scenario? Looking forward to your reply, thank you
@gongzixiangyuan Please upgrade to the latest Calico and see if this reproduces; I think we've fixed a memory leak or two. It's not easy to make kube-controllers scan the datastore without loading it all; the kubernetes client library makes that quite difficult. However, it's possible that you've discovered a memory leak bug for example.
300M does sound like a lot of memory for just 5k pods, possible that your datastore has leaked IPAM blocks (and the OOMs are killing kube-controllers, which prevents it from cleaning them up, so the problem gets worse). You may be able to clean some up with calicoctl ipam check/release
https://docs.tigera.io/calico/latest/reference/calicoctl/ipam/check#examples
@fasaxc thank you for your reply
I checked, there should be no leaks in ipamhandle, because my environment is still creating/deleting pods frequently, so I will see one or two problems
https://github.com/projectcalico/calico/pull/7433 ( This memory leak issue has also been updated to the environment )
Below is the result of my check:
[root@master1 ]# calicoctl ipam check Checking IPAM for inconsistencies...
Loading all IPAM blocks... Found 107 IPAM blocks. IPAM block 172.18.11.192/26 affinity=host:work15: IPAM block 172.18.115.128/26 affinity=host:work13: IPAM block 172.18.115.192/26 affinity=host:work13: IPAM block 172.18.115.64/26 affinity=host:work13: IPAM block 172.18.116.0/26 affinity=host:work13: IPAM block 172.18.116.64/26 affinity=host:work13: IPAM block 172.18.12.0/26 affinity=host:work15: IPAM block 172.18.12.128/26 affinity=host:work15: IPAM block 172.18.12.192/26 affinity=host:work20: IPAM block 172.18.12.64/26 affinity=host:work15: IPAM block 172.18.122.128/26 affinity=host:work12: IPAM block 172.18.122.192/26 affinity=host:work12: IPAM block 172.18.123.0/26 affinity=host:work2: IPAM block 172.18.123.128/26 affinity=host:work12: IPAM block 172.18.123.192/26 affinity=host:work2: IPAM block 172.18.123.64/26 affinity=host:work2: IPAM block 172.18.124.0/26 affinity=host:work12: IPAM block 172.18.124.64/26 affinity=host:work12: IPAM block 172.18.126.0/26 affinity=host:work8: IPAM block 172.18.126.128/26 affinity=host:work8: IPAM block 172.18.126.192/26 affinity=host:work8: IPAM block 172.18.126.64/26 affinity=host:work8: IPAM block 172.18.127.0/26 affinity=host:work8: IPAM block 172.18.13.0/26 affinity=host:work20: IPAM block 172.18.13.128/26 affinity=host:work20: IPAM block 172.18.13.192/26 affinity=host:work20: IPAM block 172.18.13.64/26 affinity=host:work20: IPAM block 172.18.130.128/26 affinity=host:work5: IPAM block 172.18.130.192/26 affinity=host:work5: IPAM block 172.18.131.0/26 affinity=host:work5: IPAM block 172.18.131.128/26 affinity=host:work5: IPAM block 172.18.131.64/26 affinity=host:work5: IPAM block 172.18.132.192/26 affinity=host:work10: IPAM block 172.18.133.0/26 affinity=host:work10: IPAM block 172.18.133.64/26 affinity=host:work10: IPAM block 172.18.136.0/26 affinity=host:master3: IPAM block 172.18.136.128/26 affinity=host:master3: IPAM block 172.18.136.64/26 affinity=host:master3: IPAM block 172.18.137.128/26 affinity=host:master1: IPAM block 172.18.137.192/26 affinity=host:master1: IPAM block 172.18.137.64/26 affinity=host:master1: IPAM block 172.18.14.0/26 affinity=host:work15: IPAM block 172.18.140.192/26 affinity=host:work18: IPAM block 172.18.141.0/26 affinity=host:work18: IPAM block 172.18.141.128/26 affinity=host:work18: IPAM block 172.18.141.192/26 affinity=host:work18: IPAM block 172.18.141.64/26 affinity=host:work18: IPAM block 172.18.180.0/26 affinity=host:master2: IPAM block 172.18.180.128/26 affinity=host:master2: IPAM block 172.18.180.64/26 affinity=host:master2: IPAM block 172.18.198.128/26 affinity=host:work17: IPAM block 172.18.198.192/26 affinity=host:work17: IPAM block 172.18.198.64/26 affinity=host:work17: IPAM block 172.18.199.0/26 affinity=host:work17: IPAM block 172.18.199.64/26 affinity=host:work17: IPAM block 172.18.204.0/26 affinity=host:work11: IPAM block 172.18.204.128/26 affinity=host:work21: IPAM block 172.18.204.192/26 affinity=host:work11: IPAM block 172.18.204.64/26 affinity=host:work21: IPAM block 172.18.205.0/26 affinity=host:work21: IPAM block 172.18.205.128/26 affinity=host:work11: IPAM block 172.18.205.192/26 affinity=host:work11: IPAM block 172.18.205.64/26 affinity=host:work11: IPAM block 172.18.215.0/26 affinity=host:work1: IPAM block 172.18.215.128/26 affinity=host:work1: IPAM block 172.18.215.64/26 affinity=host:work1: IPAM block 172.18.228.192/26 affinity=host:work22: IPAM block 172.18.229.0/26 affinity=host:work22: IPAM block 172.18.229.128/26 affinity=host:work22: IPAM block 172.18.229.192/26 affinity=host:work22: IPAM block 172.18.229.64/26 affinity=host:work22: IPAM block 172.18.243.128/26 affinity=host:work6: IPAM block 172.18.243.192/26 affinity=host:work6: IPAM block 172.18.244.0/26 affinity=host:work6: IPAM block 172.18.244.128/26 affinity=host:work6: IPAM block 172.18.244.64/26 affinity=host:work6: IPAM block 172.18.252.192/26 affinity=host:work4: IPAM block 172.18.253.0/26 affinity=host:work19: IPAM block 172.18.253.128/26 affinity=host:work19: IPAM block 172.18.253.192/26 affinity=host:work4: IPAM block 172.18.253.64/26 affinity=host:work4: IPAM block 172.18.254.0/26 affinity=host:work19: IPAM block 172.18.254.128/26 affinity=host:work4: IPAM block 172.18.254.64/26 affinity=host:work4: IPAM block 172.18.27.0/26 affinity=host:work7: IPAM block 172.18.27.128/26 affinity=host:work7: IPAM block 172.18.27.192/26 affinity=host:work7: IPAM block 172.18.27.64/26 affinity=host:work7: IPAM block 172.18.28.0/26 affinity=host:work7: IPAM block 172.18.33.192/26 affinity=host:work3: IPAM block 172.18.34.0/26 affinity=host:work3: IPAM block 172.18.34.64/26 affinity=host:work3: IPAM block 172.18.79.192/26 affinity=host:work14: IPAM block 172.18.80.0/26 affinity=host:work16: IPAM block 172.18.80.128/26 affinity=host:work16: IPAM block 172.18.80.192/26 affinity=host:work14: IPAM block 172.18.80.64/26 affinity=host:work14: IPAM block 172.18.81.0/26 affinity=host:work16: IPAM block 172.18.81.128/26 affinity=host:work16: IPAM block 172.18.81.192/26 affinity=host:work14: IPAM block 172.18.81.64/26 affinity=host:work16: IPAM block 172.18.82.0/26 affinity=host:work14: IPAM block 172.18.87.0/26 affinity=host:work9: IPAM block 172.18.87.128/26 affinity=host:work9: IPAM block 172.18.87.192/26 affinity=host:work9: IPAM block 172.18.87.64/26 affinity=host:work9: IPAM block 172.18.88.0/26 affinity=host:work9: IPAM blocks record 5886 allocations.
Loading all IPAM pools... 172.18.0.0/16 Found 1 active IP pools.
Loading all nodes. Found 0 node tunnel IPs.
Loading all workload endpoints. Found 5885 workload IPs. Workloads and nodes are using 5885 IPs.
Looking for top (up to 20) nodes by allocations... work9 has 297 allocations work15 has 296 allocations work22 has 292 allocations work12 has 292 allocations work18 has 292 allocations work8 has 291 allocations work20 has 291 allocations work13 has 291 allocations work17 has 291 allocations work4 has 290 allocations work11 has 290 allocations work16 has 289 allocations work7 has 288 allocations work5 has 287 allocations work14 has 270 allocations work21 has 164 allocations work10 has 160 allocations work19 has 159 allocations work6 has 158 allocations master3 has 153 allocations Node with most allocations has 297; median is 288
Scanning for IPs that are allocated but not actually in use... Found 1 IPs that are allocated in IPAM but not actually in use. Scanning for IPs that are in use by a workload or node but not allocated in IPAM... Found 0 in-use IPs that are not in active IP pools. Found 0 in-use IPs that are in active IP pools but have no corresponding IPAM allocation.
Check complete; found 1 problems.
@gongzixiangyuan please try the latest version; I think there have been some fixes. We also added a debug server which will let you collect a memory profile so we can see what's going on.
@gongzixiangyuan please try the latest version; I think there have been some fixes. We also added a debug server which will let you collect a memory profile so we can see what's going on.
Thank you so much! I will try again when the performance environment is OK.
@fasaxc I collected the profiles, and I suspect that IPAM sync is triggered too frequently( More than 40 times in 1 minute), causing too many pod objects to be converted to WorkloadEndpoint, and the GC is too busy.
2023-12-01 20:43:59.617 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:00.015 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:00.060 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:00.570 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:00.823 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.040 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.173 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.376 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.707 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.842 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:01.909 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:02.025 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:02.309 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:02.507 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:02.709 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:02.907 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.019 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.195 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.442 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.532 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.682 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:03.893 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.009 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.299 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.479 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.521 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.813 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:04.923 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:07.808 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:08.208 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:08.408 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:08.608 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:08.808 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:08.910 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:09.011 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:09.209 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:09.407 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:09.511 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:09.710 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:10.224 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:10.337 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:10.608 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:10.788 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:24.796 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:44:43.197 [INFO][11] ipam.go 275: Triggered IPAM sync
2023-12-01 20:45:01.160 [INFO][11] ipam.go 275: Triggered IPAM sync
WorkloadEndpointConverter.PodToWorkloadEndpoints in github.com/projectcalico/calico/libcalico-go/lib/backend/k8s/conversion/workload_endpoint.go
*ipamController.allocationIsValid in github.com/projectcalico/calico/kube-controllers/pkg/controllers/node/ipam.go
*ipamController.checkAllocations in github.com/projectcalico/calico/kube-controllers/pkg/controllers/node/ipam.go
*ipamController.syncIPAM in github.com/projectcalico/calico/kube-controllers/pkg/controllers/node/ipam.go
*ipamController.acceptScheduleRequests in github.com/projectcalico/calico/kube-controllers/pkg/controllers/node/ipam.go
*ipamController.Start in github.com/projectcalico/calico/kube-controllers/pkg/controllers/node/ipam.go
transform functions Transforming the object before it gets placed into cache. Client-go allows configuring core informers with transform functions. These functions will get called with the object as an argument before the object is placed into cache. The transformer will need to convert the object to a concrete or metadata type if it wants to retrieve its fields. This is a lesser used functionality in comparison with metadata only caching. A couple usage examples:
support for transform functions was added in controller-runtime https://github.com/kubernetes-sigs/controller-runtime/pull/1805 with the goal of allowing users to remove managed fields and annotations Istio's pilot controller uses this mechanism to configure their client-go informers to remove managed fields before putting object into cache I haven't seen any usage examples where non-metadata fields are modified using this mechanism. I cannot see a reason why new fields (i.e a label that signals that a transform was applied could not be added) as well as fields being removed.
@fasaxc Maybe you can take a look at this, this should avoid caching the entire pod
@fasaxc I see an optimization has been made here, which only caches the fields that need to be used in the pod. Can calico-kube-controller do similar enhancements?
https://github.com/cloudnativelabs/kube-router/pull/999/files
@fasaxc I see an optimization has been made here, which only caches the fields that need to be used in the pod. Can calico-kube-controller do similar enhancements?
https://github.com/cloudnativelabs/kube-router/pull/999/files
This doesn’t seem to be suitable for calico-kube-controller。 There is a simpler way to modify:
// Before podInformer Run
podInformer.SetTransform(func(i interface{}) (interface{}, error) {
// TODO: Only cache the pod attributes needed by calico-kube-controller
});
I will modify it and see the effect
k8sconfig, err = clientcmd.BuildConfigFromFlags("", kubeconfig)
// use protobuf
k8sconfig = metadata.ConfigFor(k8sconfig)
In addition, using protobuf can save some memory compared to json.
Expected Behavior
calico-kube-controller memory will not increase significantly with the number of pods
Current Behavior
Possible Solution
Steps to Reproduce (for bugs)
Context
Your Environment
Calico version 3.25