kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.94k stars 1.46k forks source link

New deployment pods never being added to new target group #2393

Closed michael-careplanner closed 2 years ago

michael-careplanner commented 2 years ago

Describe the bug Changing an ingress group and doing a deployment rollout at the same time can cause new pods to never be registered on the new target group. This causes the readiness gate on the pod to never pass, stalling the deployment rollout. The only fix is to delete the unready pod, at which point a new one is created that gets added to the new target group as expected.

Steps to reproduce nginx-chart.tar.gz

Above is a simple Helm chart that can be used to replicate this bug. It runs an nginx deployment with a service and ingress. To replicate, install this chart into a namespace with a elbv2.k8s.aws/pod-readiness-gate-inject: enabled annotation:

helm install nginx-chart nginx-chart/ --set group=internal --set tag=stable

Then, trigger an update of the chart, changing both the nginx image tag (to trigger a deployment rollout) and the ingress group at the same time:

helm upgrade nginx-chart nginx-chart/ --set group=external --set tag=latest

Sometimes it takes a few toggles between the group / tag to trigger the buggy behaviour, but after a few tries you should see that there is a single pod with the readiness gate at 0/1, and if you check the target group that pod's IP will not have been registered with the load balancer. This unready pod will remain indefinitely and never be registered to the target group.

Expected outcome The newly deployed pod is registered on the new target group, becomes healthy, and passes the readiness gate, allowing the deployment to complete successfully.

Environment

Additional Context: I think this could be a similar race condition issue to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764

Looking at the controller logs as this is happening it:

kishorj commented 2 years ago

/assign @M00nF1sh

M00nF1sh commented 2 years ago

This is a similar race condition for https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764. It's due to the TGBs are not created when the pods are created.

we can address this issue by mark pods with corresponding readinessGates to be healthy when deleting a TGB

M00nF1sh commented 2 years ago

/kind bug

oliviassss commented 2 years ago

/assign