kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.96k stars 1.47k forks source link

Pod readiness gate shows two target groups #2695

Closed tailrecur closed 2 years ago

tailrecur commented 2 years ago

Describe the bug Pod readiness gate shows two target groups

Steps to reproduce Don't know

Expected outcome Pod readiness gate shows only one target group

Environment

Additional Context: I recently enabled Pod readiness gates in multiple EKS clusters to avoid downtime due to the delay in NLB target registration. In most clusters, it is working as expected.

However, in a few clusters, it somehow detects two target groups (one of which is correct and the other one does not exist). eg:

NAME               READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ingress-lz9fd          1/1     Running   0          3h58m   10.53.65.242   ip-10-53-65-149.eu-west-1.compute.internal   <none>           1/2
ingress-q4mcp          1/1     Running   0          135m    10.53.57.38    ip-10-53-32-144.eu-west-1.compute.internal   <none>           1/2

I'm not sure if this is due to some corrupted state and I don't know how to clear it.

Here's the relevant output from describing one of the pods:

Readiness Gates:
  Type                                                       Status
  target-health.elbv2.k8s.aws/ingr-a75bc0ce22   <none>
  target-health.elbv2.k8s.aws/nlb-a99b47912d    True
Conditions:
  Type                                                      Status
  target-health.elbv2.k8s.aws/nlb-a99b47912d   True
  Initialized                                               True
  Ready                                                     False
  ContainersReady                                           True
  PodScheduled                                              True

I see this error in the load-balancer controller pod logs:

{"level":"error","ts":1655481843.6606803,"logger":"controller-runtime.manager.controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"ingr-a75bc0ce22","namespace":"default","error":"TargetGroupNotFound: Target groups 'arn:aws:elasticloadbalancing:eu-west-1:487596255802:targetgroup/ingr-a75bc0ce22/c12bb179eb4ba84d' not found\n\tstatus code: 400, request id: 16ecfcbb-1687-4d8c-b608-cfe48b4940dd"}
kishorj commented 2 years ago

@tailrecur, controller injects readiness gates during pod creation and includes the references to all of the matching target group bindings in the pod's namespace. In case of deleted tgb, the controller signals the corresponding condition as ready. In your case, could you verify whether there exist matching tgb in the pod's namespace with the reference to the non-existing target group ARN?

tailrecur commented 2 years ago

@kishorj I wasn't aware of the targetgroupbinding custom resource.

As you said, deleting the addition tgb and rolling out a deployment fixed this issue. Thanks a lot for your help!