Open ibalat opened 6 months ago
Hi, appreciate you bringing this to our attention. We will be working to reproduce this issue and investigating further, as it is unusual for the back off delay to be that value or take that long. Thank you for your understanding and patience.
/kind bug
hi @huangm777 , do you have any update for this issue?
hi @kishorj , why targetgroupbinding-max-exponential-backoff-delay
value is 16m 40s? If I set as ~1m, will effect or break another thing?
We set the default value to 1000s(or 16m40s) is to follow the upstream default: https://github.com/kubernetes/client-go/blob/62f959700d559dd8a33c1f692cb34219cfef930f/util/workqueue/default_rate_limiters.go#L52.
The value is to cap the max delay to retry a failed item, if decreasing the value to 1-3m, you will have a faster retry attempts, but it may overwhelm the workqueue, and lead to potential load increase. Better to test it in a dev environment and fine tune the value.
If I have no failed item, when I set every minute, no job works, right? Otherwise, when overwhelm the workqueue, If I increase the pod resources, can it continue correctly?
We are experiencing exactly the same behavior. I can provide outputs or help troubleshoot if needed. We have multiple deployments that experience this.
@othatbrian What is the workaround you followed to mitigate the issue
@othatbrian What is the workaround you followed to mitigate the issue
I added --targetgroupbinding-max-exponential-backoff-delay=60s
to the command line arguments for our aws-load-balancer-controller
.
Describe the bug I use pod readiness gate with alb ip mode. When a new pod is ready (after deployment or scaling), pod readiness gate is waiting too long (~15-16mins) and pod status show "
corresponding condition of pod readiness gate target-health.elbv2.k8s.aws/k8s-test-testapp-28c9478fae does not exist.
" with ReadinessGatesNotReady reason.targetgroupbindingMaxExponentialBackoffDelay
(16m40s) and I decreased this value to 10s. After this change, problem solved and at 11th second all pod readinesses be ready. But it is not a good solution. I don't know this parameter effects which statuses. Maybe it breaks another things. Btw, why it is 16m40s? Is it a reason for this?Failed to update endpoint test/test-app: Operation cannot be fulfilled on endpoints "test-app": the object has been modified; please apply your changes to the latest version and try again
This is so complex status. Please someone explain it or give a solution.
Steps to reproduce
slow_start.duration_seconds=60s
Expected outcome pod readinesses have to be ready after pod is ready in a short time
Environment