Closed michael-careplanner closed 2 years ago
/assign @M00nF1sh
This is a similar race condition for https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764. It's due to the TGBs are not created when the pods are created.
we can address this issue by mark pods with corresponding readinessGates to be healthy when deleting a TGB
/kind bug
/assign
Describe the bug Changing an ingress group and doing a deployment rollout at the same time can cause new pods to never be registered on the new target group. This causes the readiness gate on the pod to never pass, stalling the deployment rollout. The only fix is to delete the unready pod, at which point a new one is created that gets added to the new target group as expected.
Steps to reproduce nginx-chart.tar.gz
Above is a simple Helm chart that can be used to replicate this bug. It runs an nginx deployment with a service and ingress. To replicate, install this chart into a namespace with a
elbv2.k8s.aws/pod-readiness-gate-inject: enabled
annotation:helm install nginx-chart nginx-chart/ --set group=internal --set tag=stable
Then, trigger an update of the chart, changing both the nginx image tag (to trigger a deployment rollout) and the ingress group at the same time:
helm upgrade nginx-chart nginx-chart/ --set group=external --set tag=latest
Sometimes it takes a few toggles between the group / tag to trigger the buggy behaviour, but after a few tries you should see that there is a single pod with the readiness gate at
0/1
, and if you check the target group that pod's IP will not have been registered with the load balancer. This unready pod will remain indefinitely and never be registered to the target group.Expected outcome The newly deployed pod is registered on the new target group, becomes healthy, and passes the readiness gate, allowing the deployment to complete successfully.
Environment
Additional Context: I think this could be a similar race condition issue to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764
Looking at the controller logs as this is happening it: