kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.82k stars 1.41k forks source link

Failed to start with big numbers of Pods running #3723

Open bacek opened 1 month ago

bacek commented 1 month ago

Based on my understanding of code in this line https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/e5c21db1192315be31776007817365c946b723e8/pkg/k8s/pod_info_repo.go#L18

ALB controller tries to sync state of Pods on start-up in 2 seconds. With large amount of pods on EKS, control plane is not able to produce results in 2 seconds. For example:

$ time bash -c "kubectl get pod -A | wc -l"
13319

real    0m9.306s
user    0m8.062s
sys 0m0.704s

It takes almost 10 seconds to list 13k pods.

huangm777 commented 1 month ago

Thank you for bringing this issue to our attention. We are doing investigation on it and will be following up on it as soon as we have more information.

bacek commented 1 month ago

That's probably caused by CPU limits set too. We configured very small limits for ALB controller (100m).

oliviassss commented 1 month ago

@bacek, would you be able to provide the controller logs when it failed to start? also, if it's cpu limit issue, you can specify it via the helm chart values: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/helm/aws-load-balancer-controller/values.yaml#L62 We didn't have such a large scale setup to test the load for the controller, and I'm not sure what exact failed the start of the controller. but maybe scale up the replicas can help in some use case, like if there is load induced failure by the calls to the aws-load-balancer-webhook-service. You can check some details here: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/helm/aws-load-balancer-controller/values.yaml#L19

bacek commented 4 weeks ago

@oliviassss Logs were just a standard startups logs with the error at the end. Issue was resolved when we bumped resource limits to 500m for CPU.