Throttling: Rate exceeded error when reconcile ALB

kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers

https://kubernetes-sigs.github.io/aws-load-balancer-controller/

Apache License 2.0

3.93k stars 1.46k forks source link

Throttling: Rate exceeded error when reconcile ALB #1871

Closed thanhma closed 1 year ago

thanhma commented 3 years ago

I'm getting the Rate exceeded error in v2.1.3

{"level":"error","ts":1615309687.1415963,"logger":"controller","msg":"Reconciler error","reconcilerGroup":"elbv2.k8s.aws","reconcilerKind":"TargetGroupBinding","controller":"targetGroupBinding","name":"k8s-default-alias4bh-d5c9e3xxxx","namespace":"default","error":"Throttling: Rate exceeded\n\tstatus code: 400, request id: 4eef24db-b4b3-46bf-88da-5ed3eb4xxxxx"}

The targetgroupbindingMaxConcurrentReconciles is left to None (as by default).

After query the ThrottlingException as guided in https://aws.amazon.com/premiumsupport/knowledge-center/cloudtrail-rate-exceeded/, I found that exception thrown on DescribeTargetHealth, DescribeTargetGroups, DescribeTags, RegisterTargets... events

The number of ingress is big (about 10 ingresses per cluster, 200 ingresses in a single VPC). Target type is instance, each cluster has about 60 instances and the number of target per ingress may reach to 5,000 since rules are numerous as well.

M00nF1sh commented 3 years ago

@thanhma By default, the controller didn't set any throttling for the AWS API calls, and we provide a flag to set it by customers. let me find the recommended settings and reply back here

thanhma commented 3 years ago

@M00nF1sh In CloudTrail log, I can see many DescribeTags, DescribeTargetHealth, DescribeTargetGroups calls getting ThrottlingException. Do these failures affect reconcile with instance target type? May it cause target register failed, or performance degrade?

Do we have a rate limit configuration?

thanhma commented 3 years ago

@M00nF1sh My cluster stopped reconciling ALB because of RegisterTarget facing Throttling Exception: Rate exceeded error. Is there a way to optimize API calls to prevent throttling?

M00nF1sh commented 3 years ago

@thanhma

Yes, you can set via controller flag: ~~--aws-api-throttle=Elastic Load Balancing v2:^RegisterTarget=20:3,Elastic Load Balancing v2:^Describe=40:10~~

--aws-api-throttle=Elastic Load Balancing v2:RegisterTargets|DeregisterTargets=4:20,Elastic Load Balancing v2:.*=10:40 (I made a mistake in the above crossed out config, ELB team internally used burst:rate when describing throttling config, while our controller uses rate:burst)

Above is the default throttle for normal AWS accounts. If you have multiple controller/cluster, they are sharing the same throttle per account/region, and you'll need to lower the throttle settings above.

ajaykumarmandapati commented 3 years ago

Hi @M00nF1sh, we do have a similar issue and face throttling exceptions as well, could you please direct us on how exactly to configure --aws-api-throttle parameter? We also found this documentation here but could not relate to it since we already disabled waf , waf2 and shield. Thanks in advance!

thanhma commented 3 years ago

@ajaykumarmandapati I found setting the api-throttle quite difficult if we don't know when the API throttle reached (how many calls per second is acceptable).

However, in v2.2.0, you can use alb.ingress.kubernetes.io/target-node-labels to limit the targets in a single TargetGroup, thus lower the chance API hits the throttle.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

lmbaschiera commented 1 year ago

I'm having this same issue with v2.4.4 after upgrading from v1.2.0-alpha.1 Hundreds of DescribeTargetHealth, DescribeTargetGroups, DescribeTags and RegisterTargets events are making us hit the API rate limits impacting all clusters - even the ones not upgraded to the new controller version.

How can I fix this issue? I've asked for an API rate limit increase but that's just a band-aid for this issue.

oliviassss commented 1 year ago

@thanhma The RGT API support has been shipped with the v2.5.2 release, you could enable this feature by the feature gate flag EnableRGTAPI to avoid the throttling issue. Please refer to the release note for more details.