kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.93k stars 1.46k forks source link

Rate limit issues causes deadlock issues, which causes outages. #1048

Closed tecnobrat closed 4 years ago

tecnobrat commented 5 years ago

Version: 1.1.2

We encountered a deadlock in ALB ingress controller due to rate limits (and likely the backoff adding to rate limit issues), which caused an outage.

The deadlock was caused by this:

E1015 14:45:11.579790       1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="error getting web acl for load balancer arn:aws:elasticloadbalancing:us-east-1:062151437226:loadbalancer/app/fffb5690-default-broadcast-bf66/a84fdc267491f064: ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: 9f03f8c3-f087-4c5c-88e3-545fb3ae5c47"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"broadcaster-job-ui"}

This blocked the reconciler for 10 minutes.

During our rolling deploy of new pods, which normally are added and removed from the ALB one by one, instead due to the deadlock all of the pods became unhealthy until it updated them all at the same time.

I1015 14:45:16.861165       1 targets.go:80] default/accounts-rest: Adding targets to arn:aws:elasticloadbalancing:us-east-1:062151437226:targetgroup/fffb5690-77f48c3689d480d89a0/39213728bb743224: 10.128.0.90:3000, 10.128.15.161:3000, 10.128.9.102:3000
I1015 14:45:17.148664       1 targets.go:95] default/accounts-rest: Removing targets from arn:aws:elasticloadbalancing:us-east-1:062151437226:targetgroup/fffb5690-77f48c3689d480d89a0/39213728bb743224: 10.128.10.211:3000, 10.128.1.169:3000, 10.128.0.191:3000

Normally what you see is it adds a single target, then removes a single target, repeat until all three are done.

tecnobrat commented 5 years ago

I wonder if enabling the new concurrent reconcilers on 1.1.3 will help or make this work. I could see that using MORE rate limit?

M00nF1sh commented 5 years ago

hi, are u actually use waf feature? if not, you can disable it with "--feature-gates=waf=false" to disable waf API calls.

M00nF1sh commented 5 years ago

i'll add throttling in next master release. The v2 branch deals with this more nicely by decouple infrastructure change with pod change(pod relocate only trigger target group register without other api calls)

tecnobrat commented 5 years ago

We're going to disable WAF for the time being due to its low rate limit.

Unfortunately we would prefer to have WAF setup, but we currently don't have any rules on it, its purely there for best practices so we don't need to add it in a rush if there was an attack or something.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

M00nF1sh commented 4 years ago

@tecnobrat Hi, we just released an version to cache WAF API calls 😄 https://github.com/kubernetes-sigs/aws-alb-ingress-controller/releases/tag/v1.1.5

The final solution still pending in V2(which decouples infrastructure with target updates)

tecnobrat commented 4 years ago

Unfortunately this is still an issue for us...

E0128 18:48:24.531490       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to get web acl for load balancer arn:aws:elasticloadbalancing:us-east-1:339498453354:loadbalancer/app/fffb5690-default-magistrat-09cf/2aef934169ff0678: ThrottlingException: Rate exceeded
\tstatus code: 400, request id: b0b71af4-2dc4-4ed9-bcf4-68b1b35b264b"  "controller"="alb-ingress-controller" "request"={"Namespace":"default","Name":"magistrate-rest"}

Image: docker.io/amazon/aws-alb-ingress-controller:v1.1.5

ab77 commented 4 years ago

Same here as well as causing a major outage to the backend systems, running docker.io/amazon/aws-alb-ingress-controller:v1.1.6, but apparently in v1.1.7 WAFV2 support can be disabled by controller flags --feature-gates=wafv2=false, so this is good.

yzargari commented 4 years ago

I've been hitting the rate limit for WAF as well. I also had to add --feature-gate=waf=false to fix the issue...

thanhma commented 3 years ago

I'm getting the Rate exceeded errror in v2.1.3

{"level":"error","ts":1615309687.1415963,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"elbv2.k8s.aws","reconcilerKind":"TargetGroupBinding","controller":"targetGroupBinding","name":"k8s-default-alias4bh-d5c9e3xxxx","namespace":"default","error":"Throttling: Rate exceeded\n\tstatus code: 400, request id: 4eef24db-b4b3-46bf-88da-5ed3eb4xxxxx"}

The targetgroupbindingMaxConcurrentReconciles is left to None (as by default).

After query the ThrottlingException as guided in https://aws.amazon.com/premiumsupport/knowledge-center/cloudtrail-rate-exceeded/, I found that exception thrown on DescribeTargetHealth, DescribeTargetGroups, DescribeTags, RegisterTargets... events

Not sure how to fix it.

kishorj commented 3 years ago

@thanhma please create a new issue with the details, the number of ingresses/services, #targets, instance or ip targets, NLB/ALB

thanhma commented 3 years ago

Thanks @kishorj, opened new issue #1871