Closed kwarunek closed 2 months ago
/area performance @kwarunek thanks for sharing this, this is a interesting problem as the standby replica only have cache for k8s Objects, but not cache for AWS objects(where the master replica have such cache in memory).
I think the timestamp is a good idea to skip already fully reconciled TGBs.
BTW, Would you help share the controller logs as well, would like to understand the operations done in the controller.
@M00nF1sh I will prepare logs (info level) with redacted names
@kwarunek, hi, would you consider enabling RGT API by via controller flag --feature-gates=EnableRGTAPI=true
? It can avoid ELB API throttling issue and will help to reduce the reconcile time, especially for the case where there are numerous resources. You can check more about the feature gate flag in our release note and live doc. Please be mindful that RGT API does not work on private clusters.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@kwarunek We have also made some improvements around this area in v2.7.1. Could you please upgrade to this new version and see if this resolves your problem?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
It's a bit better but still it takes ~10minutes
/reopen
@kwarunek: Reopened this issue.
++ we are facing this issue right now. Our aws api also gets throttled
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
There are customers affected by this, especially at bigger scale.
@xdrus: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
@kwarunek: Reopened this issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
The fix for this was released under 2.9.1.
Describe the bug We have deployed the aws-load-balancer-controller with 2 replicas. However, during the leader change (due to rotation or the TTL of the worker node), the other replica takes over leadership. This transition takes a considerable amount of time to fully complete operations. It appears that the new replica rebuilds the model and attempts to reconcile (reread all TG/ALB from AWS API to be precise) all configurations, even for the tg/alb that do not require it.
In the initial 10-20 minutes after starting or assuming leadership, any changes made to the endpoints experience delays and are not promptly reflected in the TG/ALB. This delay persists until the 'start/first_reconcile' process is finished.
This makes no sense to deploy more than 1 instance of controller.
From the metrics (attached screenshots) we deduce that it's probably due to API limits/throttling.
Expected outcome The model needs to be rebuilt, but reconciliation could potentially rely on a timestamp (endpoint last change vs tgb last reconcile) within the TargetGroupBinding. This approach would involve considering only the changes that have occurred.
Environment
Additional Context: Number of
ALB controller args: