New deployment pods never being added to new target group

michael-careplanner commented 2 years ago

Describe the bug Changing an ingress group and doing a deployment rollout at the same time can cause new pods to never be registered on the new target group. This causes the readiness gate on the pod to never pass, stalling the deployment rollout. The only fix is to delete the unready pod, at which point a new one is created that gets added to the new target group as expected.

Steps to reproduce nginx-chart.tar.gz

Above is a simple Helm chart that can be used to replicate this bug. It runs an nginx deployment with a service and ingress. To replicate, install this chart into a namespace with a elbv2.k8s.aws/pod-readiness-gate-inject: enabled annotation:

helm install nginx-chart nginx-chart/ --set group=internal --set tag=stable

Then, trigger an update of the chart, changing both the nginx image tag (to trigger a deployment rollout) and the ingress group at the same time:

helm upgrade nginx-chart nginx-chart/ --set group=external --set tag=latest

Sometimes it takes a few toggles between the group / tag to trigger the buggy behaviour, but after a few tries you should see that there is a single pod with the readiness gate at 0/1, and if you check the target group that pod's IP will not have been registered with the load balancer. This unready pod will remain indefinitely and never be registered to the target group.

Expected outcome The newly deployed pod is registered on the new target group, becomes healthy, and passes the readiness gate, allowing the deployment to complete successfully.

Environment

AWS Load Balancer controller version:
- 2.2.3
Kubernetes version:
- 1.21
Using EKS (yes/no), if so version?
- Yes. 1.21

Additional Context: I think this could be a similar race condition issue to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764

Looking at the controller logs as this is happening it:

Register the newly created pod against the old target group
Deregister this pod against the old target group
Delete the old target group
Create a new target group
Never register the new pod into the new target group

kishorj commented 2 years ago

/assign @M00nF1sh

M00nF1sh commented 2 years ago

This is a similar race condition for https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1764. It's due to the TGBs are not created when the pods are created.

we can address this issue by mark pods with corresponding readinessGates to be healthy when deleting a TGB

M00nF1sh commented 2 years ago

/kind bug

oliviassss commented 2 years ago

/assign

kubernetes-sigs / aws-load-balancer-controller

New deployment pods never being added to new target group #2393