AWS application load balancer not registering targets for Kubernetes EKS node target group

Describe the bug I have an EKS cluster with public/private access on a VPC with public and private subnets. I've setup my ALB in the public subnets on port 80, internet-facing and ip and installed the AWS controller following example through AWS docs and 2048 deployment example. I am using GPU nodes and also set up Kubernetes GPU operator. I have a deployment and service for a flask rest api.

After getting everything setup, I expected the EKS cluster node instances I have running to register into my target group but its empty and the pods have no instances to join.

Here is a screenshot of the ALB and the empty target group from the AWS console

loadbalancer

Screenshot 2024-05-09 172514

I'm struggling to find an answer as to why this is happening. I've been messing with my ingress and deployment yaml files and thought it was maybe a selector/label issue but that doesn't seem to be the case. My deployment is running a flask api on port 5000 and I am setting a /health path to hit the flask api server /health endpoint and return response.

Deployment.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: flask-api-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-api-deployment
  namespace: flask-api-app
  labels:
    app.kubernetes.io/name: flask-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: flask-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: flask-app
    spec:
      containers:
      - name: flask-app
        image: xxxxxxxxxxxxxxxxxxxxxxx
        imagePullPolicy: Always
        ports:
        - containerPort: 5000
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      restartPolicy: Always
      volumes:
        - name: persistent-storage
          persistentVolumeClaim:
            claimName: efs-claim  
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  replicas: 1
---
apiVersion: v1
kind: Service
metadata:
  name: flask-api-app-service
  namespace: flask-api-app
  labels:
    app.kubernetes.io/name: flask-app
spec:
  type: NodePort
  selector:
    app.kubernetes.io/name: flask-app
  ports:
    - name: http
      port: 80
      targetPort: 5000
      protocol: TCP

ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: flask-api-app
  name: flask-ingress-3
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/is-default-class: "true"
  labels:
    app.kubernetes.io/name: flask-app
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
        - path: /health 
          pathType: Prefix
          backend:
            service:
              name: flask-api-app-service
              port:
                number: 80

service-account.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: aws-load-balancer-controller
  name: aws-load-balancer-controller
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

This is the dockerfile that I built for the deployment:

# start by pulling the python image
FROM python:3.9

# copy the requirements file into the image
COPY ./requirements.txt /app/requirements.txt

# switch working directory
WORKDIR /app

# install the dependencies and packages in the requirements file
RUN pip install -r requirements.txt

# copy every content from the local file to the image
COPY . /app

# Expose port 5000 for Gunicorn
EXPOSE 5000

# Configure the container to run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "main:app"]

I also ran the command kubectl describe targetgroupbindings -n flask-api-app and this was the result:

Name:         k8s-flaskapi-flaskapi-c99c751836
Namespace:    flask-api-app
Labels:       ingress.k8s.aws/stack-name=flask-ingress-3
              ingress.k8s.aws/stack-namespace=flask-api-app
Annotations:  <none>
API Version:  elbv2.k8s.aws/v1beta1
Kind:         TargetGroupBinding
Metadata:
  Creation Timestamp:  xxxxxxxxxxxxxxxx
  Finalizers:
    elbv2.k8s.aws/resources
  Generation:        1
  Resource Version:  1802318
  UID:               xxxxxxxxxxxxxxxxxxxxxxxxx
Spec:
  Ip Address Type:  ipv4
  Networking:
    Ingress:
      From:
        Security Group:
          Group ID:  xxxxxxxxxxxxxxxxxxxx
      Ports:
        Port:      5000
        Protocol:  TCP
  Service Ref:
    Name:            flask-api-app-service
    Port:            80
  Target Group ARN:  xxxxxxxxxxxxxxxxxxxxxxxx
  Target Type:       ip
Status:
  Observed Generation:  1
Events:
  Type    Reason                  Age                From                Message
  ----    ------                  ----               ----                -------
  Normal  SuccessfullyReconciled  10m (x2 over 10m)  targetGroupBinding  Successfully reconciled

namespaces:

 kubectl get namespaces
NAME              STATUS   AGE
default           Active   8d
flask-api-app     Active   40m
gpu-operator      Active   7d7h
kube-node-lease   Active   8d
kube-public       Active   8d
kube-system       Active   8d

 kubectl get all -n kube-system
NAME                                                READY   STATUS    RESTARTS   AGE
pod/aws-load-balancer-controller-6bf4b948d6-c2h9s   1/1     Running   0          40m
pod/aws-load-balancer-controller-6bf4b948d6-h4sqp   1/1     Running   0          40m
pod/aws-node-25wtp                                  2/2     Running   0          51m
pod/aws-node-mfgjn                                  2/2     Running   0          51m
pod/coredns-6c857f58b4-hhq74                        1/1     Running   0          50m
pod/coredns-6c857f58b4-mn2k2                        1/1     Running   0          65m
pod/efs-csi-controller-bb6f8464b-tjd4j              3/3     Running   0          65m
pod/efs-csi-controller-bb6f8464b-zzrjl              3/3     Running   0          65m
pod/efs-csi-node-6rj6n                              3/3     Running   0          51m
pod/efs-csi-node-kdfmh                              3/3     Running   0          51m
pod/eks-pod-identity-agent-pbk84                    1/1     Running   0          51m
pod/eks-pod-identity-agent-qnh8b                    1/1     Running   0          51m
pod/kube-proxy-d59bz                                1/1     Running   0          51m
pod/kube-proxy-n4vjr                                1/1     Running   0          51m

NAME                                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
service/aws-load-balancer-webhook-service   ClusterIP   172.20.157.47   <none>        443/TCP                  40m
service/kube-dns                            ClusterIP   172.20.0.10     <none>        53/UDP,53/TCP,9153/TCP   8d

NAME                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/aws-node                 2         2         2       2            2           <none>                   8d
daemonset.apps/efs-csi-node             2         2         2       2            2           kubernetes.io/os=linux   6d
daemonset.apps/eks-pod-identity-agent   2         2         2       2            2           <none>                   8d
daemonset.apps/kube-proxy               2         2         2       2            2           <none>                   8d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/aws-load-balancer-controller   2/2     2            2           40m
deployment.apps/coredns                        2/2     2            2           8d
deployment.apps/efs-csi-controller             2/2     2            2           6d

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/aws-load-balancer-controller-6bf4b948d6   2         2         2       40m
replicaset.apps/coredns-6556f9967c                        0         0         0       8d
replicaset.apps/coredns-6c857f58b4                        2         2         2       8d
replicaset.apps/efs-csi-controller-bb6f8464b              2         2         2       6d

 kubectl get all -n gpu-operator
NAME                                                                  READY   STATUS             RESTARTS   AGE
pod/gpu-feature-discovery-nh94s                                       0/1     Init:0/1           0          51m
pod/gpu-feature-discovery-v8fgf                                       0/1     Init:0/1           0          51m
pod/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd66t7t7j   1/1     Running            0          66m
pod/gpu-operator-1714659266-node-feature-discovery-master-5598gsztr   1/1     Running            0          66m
pod/gpu-operator-1714659266-node-feature-discovery-worker-229sp       1/1     Running            0          52m
pod/gpu-operator-1714659266-node-feature-discovery-worker-5z6kj       1/1     Running            0          52m
pod/gpu-operator-cc9db7497-l2s89                                      1/1     Running            0          66m
pod/nvidia-container-toolkit-daemonset-6dt46                          0/1     Init:0/1           0          51m
pod/nvidia-container-toolkit-daemonset-mx4w4                          0/1     Init:0/1           0          51m
pod/nvidia-dcgm-exporter-6nh2x                                        0/1     Init:0/1           0          51m
pod/nvidia-dcgm-exporter-96hww                                        0/1     Init:0/1           0          51m
pod/nvidia-device-plugin-daemonset-jg4d9                              0/1     Init:0/1           0          51m
pod/nvidia-device-plugin-daemonset-r524n                              0/1     Init:0/1           0          51m
pod/nvidia-driver-daemonset-rfj5c                                     0/1     ImagePullBackOff   0          52m
pod/nvidia-driver-daemonset-rgpgh                                     0/1     ImagePullBackOff   0          52m
pod/nvidia-operator-validator-4mkt9                                   0/1     Init:0/4           0          51m
pod/nvidia-operator-validator-9kj2s                                   0/1     Init:0/4           0          51m

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator           ClusterIP   172.20.82.101    <none>        8080/TCP   7d7h
service/nvidia-dcgm-exporter   ClusterIP   172.20.248.145   <none>        9400/TCP   7d7h

NAME                                                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
daemonset.apps/gpu-feature-discovery                                   2         2         0       2            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       7d7h
daemonset.apps/gpu-operator-1714659266-node-feature-discovery-worker   2         2         2       2            2           <none>                                                                 7d7h
daemonset.apps/nvidia-container-toolkit-daemonset                      2         2         0       2            0           nvidia.com/gpu.deploy.container-toolkit=true                           7d7h
daemonset.apps/nvidia-dcgm-exporter                                    2         2         0       2            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               7d7h
daemonset.apps/nvidia-device-plugin-daemonset                          2         2         0       2            0           nvidia.com/gpu.deploy.device-plugin=true                               7d7h
daemonset.apps/nvidia-device-plugin-mps-control-daemon                 0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   7d7h
daemonset.apps/nvidia-driver-daemonset                                 2         2         0       2            0           nvidia.com/gpu.deploy.driver=true                                      7d7h
daemonset.apps/nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 7d7h
daemonset.apps/nvidia-operator-validator                               2         2         0       2            0           nvidia.com/gpu.deploy.operator-validator=true                          7d7h

NAME                                                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                            1/1     1            1           7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-gc       1/1     1            1           7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-master   1/1     1            1           7d7h

NAME                                                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd6644       1         1         1       7d7h
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-master-559868b8df   1         1         1       7d7h
replicaset.apps/gpu-operator-cc9db7497                                             1         1         1       7d7h

 kubectl get all -n flask-api-app
NAME                                        READY   STATUS    RESTARTS   AGE
pod/flask-api-deployment-59c668dcf8-wzl6p   0/1     Pending   0          44m

NAME                            TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service/flask-api-app-service   NodePort   172.20.201.77   <none>        80:32235/TCP   44m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/flask-api-deployment   0/1     1            0           44m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/flask-api-deployment-59c668dcf8   1         1         0       44m

Environment Amazon Linux 2 Ubuntu

AWS Load Balancer controller version 2.7.2(???)

this is the output of

kubectl describe deployment -n kube-system aws-load-balancer-controller
Name:                   aws-load-balancer-controller
Namespace:              kube-system
CreationTimestamp:      Thu, 09 May 2024 17:00:58 -0400
Labels:                 app.kubernetes.io/instance=aws-load-balancer-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=aws-load-balancer-controller
                        app.kubernetes.io/version=v2.7.2
                        helm.sh/chart=aws-load-balancer-controller-1.7.2
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: aws-load-balancer-controller
                        meta.helm.sh/release-namespace: kube-system
Selector:               app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas:               2 desired | 2 updated | 2 total | 0 available | 2 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=aws-load-balancer-controller
                    app.kubernetes.io/name=aws-load-balancer-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  aws-load-balancer-controller
  Containers:
   aws-load-balancer-controller:
    Image:       public.ecr.aws/eks/aws-load-balancer-controller:v2.7.2
    Ports:       9443/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --cluster-name=EKS-Test-Cluster
      --ingress-class=alb
    Liveness:     http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Readiness:    http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
  Volumes:
   cert:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         aws-load-balancer-tls
    Optional:           false
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      False   MinimumReplicasUnavailable
OldReplicaSets:  <none>
NewReplicaSet:   aws-load-balancer-controller-6bf4b948d6 (2/2 replicas created)
Events:          <none>

Kubernetes version 1.29
Using EKS (yes/no), if so version? yes, EKS.6

kubernetes-sigs / aws-load-balancer-controller

AWS application load balancer not registering targets for Kubernetes EKS node target group #3690