Open bacek opened 1 month ago
Thank you for bringing this issue to our attention. We are doing investigation on it and will be following up on it as soon as we have more information.
That's probably caused by CPU limits
set too. We configured very small limits for ALB controller (100m
).
@bacek, would you be able to provide the controller logs when it failed to start? also, if it's cpu limit issue, you can specify it via the helm chart values: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/helm/aws-load-balancer-controller/values.yaml#L62
We didn't have such a large scale setup to test the load for the controller, and I'm not sure what exact failed the start of the controller. but maybe scale up the replicas can help in some use case, like if there is load induced failure by the calls to the aws-load-balancer-webhook-service
. You can check some details here: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/helm/aws-load-balancer-controller/values.yaml#L19
@oliviassss Logs were just a standard startups logs with the error at the end. Issue was resolved when we bumped resource limits to 500m
for CPU.
Based on my understanding of code in this line https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/e5c21db1192315be31776007817365c946b723e8/pkg/k8s/pod_info_repo.go#L18
ALB controller tries to sync state of Pods on start-up in 2 seconds. With large amount of pods on EKS, control plane is not able to produce results in 2 seconds. For example:
It takes almost 10 seconds to list 13k pods.