AWS load balancer controller takes a long time to start (to fully operate)

kwarunek commented 1 year ago

Describe the bug We have deployed the aws-load-balancer-controller with 2 replicas. However, during the leader change (due to rotation or the TTL of the worker node), the other replica takes over leadership. This transition takes a considerable amount of time to fully complete operations. It appears that the new replica rebuilds the model and attempts to reconcile (reread all TG/ALB from AWS API to be precise) all configurations, even for the tg/alb that do not require it.

In the initial 10-20 minutes after starting or assuming leadership, any changes made to the endpoints experience delays and are not promptly reflected in the TG/ALB. This delay persists until the 'start/first_reconcile' process is finished.

This makes no sense to deploy more than 1 instance of controller.

From the metrics (attached screenshots) we deduce that it's probably due to API limits/throttling.

Expected outcome The model needs to be rebuilt, but reconciliation could potentially rely on a timestamp (endpoint last change vs tgb last reconcile) within the TargetGroupBinding. This approach would involve considering only the changes that have occurred.

Environment

AWS Load Balancer controller version 2.5.2
Kubernetes version 1.24
Using EKS: yes

Additional Context: Number of

provisioned ALB by controller: ~100
TargetGRoupBindings: ~ 1500

ALB controller args:

--cluster-name=zzzzzz
--ingress-class=alb
--aws-region=yy-xxxx-1
--aws-vpc-id=vpc-xxxxxxxx
--enable-shield=false

Screenshot 2023-08-10 at 00-41-47 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana Screenshot 2023-08-10 at 00-40-31 AWS Load Balancer Controller - EKS - k8s-apps-v1 - Dashboards - Grafana

M00nF1sh commented 1 year ago

/area performance @kwarunek thanks for sharing this, this is a interesting problem as the standby replica only have cache for k8s Objects, but not cache for AWS objects(where the master replica have such cache in memory).

I think the timestamp is a good idea to skip already fully reconciled TGBs.

BTW, Would you help share the controller logs as well, would like to understand the operations done in the controller.

kwarunek commented 1 year ago

@M00nF1sh I will prepare logs (info level) with redacted names

oliviassss commented 1 year ago

@kwarunek, hi, would you consider enabling RGT API by via controller flag --feature-gates=EnableRGTAPI=true? It can avoid ELB API throttling issue and will help to reduce the reconcile time, especially for the case where there are numerous resources. You can check more about the feature gate flag in our release note and live doc. Please be mindful that RGT API does not work on private clusters.

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shraddhabang commented 9 months ago

@kwarunek We have also made some improvements around this area in v2.7.1. Could you please upgrade to this new version and see if this resolves your problem?

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2054160903): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kwarunek commented 6 months ago

It's a bit better but still it takes ~10minutes

Screenshot 2024-05-13 at 14-26-16 AWS Load Balancer Controller - AWS - Dashboards - Grafana

kwarunek commented 6 months ago

/reopen

k8s-ci-robot commented 6 months ago

@kwarunek: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2107446171): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

yuvraj9 commented 5 months ago

++ we are facing this issue right now. Our aws api also gets throttled

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 4 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2212359784): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

xdrus commented 4 months ago

/reopen

There are customers affected by this, especially at bigger scale.

k8s-ci-robot commented 4 months ago

@xdrus: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2212742124): >/reopen > >There are customers affected by this, especially at bigger scale. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kwarunek commented 3 months ago

/reopen

k8s-ci-robot commented 3 months ago

@kwarunek: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2260561284): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3326#issuecomment-2321391104): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

zac-nixon commented 1 month ago

The fix for this was released under 2.9.1.

kubernetes-sigs / aws-load-balancer-controller

AWS load balancer controller takes a long time to start (to fully operate) #3326