kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.94k stars 1.46k forks source link

AWS application load balancer not registering targets for Kubernetes EKS node target group #3690

Closed asluborski closed 6 months ago

asluborski commented 6 months ago

Describe the bug I have an EKS cluster with public/private access on a VPC with public and private subnets. I've setup my ALB in the public subnets on port 80, internet-facing and ip and installed the AWS controller following example through AWS docs and 2048 deployment example. I am using GPU nodes and also set up Kubernetes GPU operator. I have a deployment and service for a flask rest api.

After getting everything setup, I expected the EKS cluster node instances I have running to register into my target group but its empty and the pods have no instances to join.

Here is a screenshot of the ALB and the empty target group from the AWS console

loadbalancer

Screenshot 2024-05-09 172514

I'm struggling to find an answer as to why this is happening. I've been messing with my ingress and deployment yaml files and thought it was maybe a selector/label issue but that doesn't seem to be the case. My deployment is running a flask api on port 5000 and I am setting a /health path to hit the flask api server /health endpoint and return response.

Deployment.yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: flask-api-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-api-deployment
  namespace: flask-api-app
  labels:
    app.kubernetes.io/name: flask-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: flask-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: flask-app
    spec:
      containers:
      - name: flask-app
        image: xxxxxxxxxxxxxxxxxxxxxxx
        imagePullPolicy: Always
        ports:
        - containerPort: 5000
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      restartPolicy: Always
      volumes:
        - name: persistent-storage
          persistentVolumeClaim:
            claimName: efs-claim  
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  replicas: 1
---
apiVersion: v1
kind: Service
metadata:
  name: flask-api-app-service
  namespace: flask-api-app
  labels:
    app.kubernetes.io/name: flask-app
spec:
  type: NodePort
  selector:
    app.kubernetes.io/name: flask-app
  ports:
    - name: http
      port: 80
      targetPort: 5000
      protocol: TCP

ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: flask-api-app
  name: flask-ingress-3
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/is-default-class: "true"
  labels:
    app.kubernetes.io/name: flask-app
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
        - path: /health 
          pathType: Prefix
          backend:
            service:
              name: flask-api-app-service
              port:
                number: 80

service-account.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: aws-load-balancer-controller
  name: aws-load-balancer-controller
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

This is the dockerfile that I built for the deployment:

# start by pulling the python image
FROM python:3.9

# copy the requirements file into the image
COPY ./requirements.txt /app/requirements.txt

# switch working directory
WORKDIR /app

# install the dependencies and packages in the requirements file
RUN pip install -r requirements.txt

# copy every content from the local file to the image
COPY . /app

# Expose port 5000 for Gunicorn
EXPOSE 5000

# Configure the container to run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "main:app"]

I also ran the command kubectl describe targetgroupbindings -n flask-api-app and this was the result:

Name:         k8s-flaskapi-flaskapi-c99c751836
Namespace:    flask-api-app
Labels:       ingress.k8s.aws/stack-name=flask-ingress-3
              ingress.k8s.aws/stack-namespace=flask-api-app
Annotations:  <none>
API Version:  elbv2.k8s.aws/v1beta1
Kind:         TargetGroupBinding
Metadata:
  Creation Timestamp:  xxxxxxxxxxxxxxxx
  Finalizers:
    elbv2.k8s.aws/resources
  Generation:        1
  Resource Version:  1802318
  UID:               xxxxxxxxxxxxxxxxxxxxxxxxx
Spec:
  Ip Address Type:  ipv4
  Networking:
    Ingress:
      From:
        Security Group:
          Group ID:  xxxxxxxxxxxxxxxxxxxx
      Ports:
        Port:      5000
        Protocol:  TCP
  Service Ref:
    Name:            flask-api-app-service
    Port:            80
  Target Group ARN:  xxxxxxxxxxxxxxxxxxxxxxxx
  Target Type:       ip
Status:
  Observed Generation:  1
Events:
  Type    Reason                  Age                From                Message
  ----    ------                  ----               ----                -------
  Normal  SuccessfullyReconciled  10m (x2 over 10m)  targetGroupBinding  Successfully reconciled

namespaces:

 kubectl get namespaces
NAME              STATUS   AGE
default           Active   8d
flask-api-app     Active   40m
gpu-operator      Active   7d7h
kube-node-lease   Active   8d
kube-public       Active   8d
kube-system       Active   8d
 kubectl get all -n kube-system
NAME                                                READY   STATUS    RESTARTS   AGE
pod/aws-load-balancer-controller-6bf4b948d6-c2h9s   1/1     Running   0          40m
pod/aws-load-balancer-controller-6bf4b948d6-h4sqp   1/1     Running   0          40m
pod/aws-node-25wtp                                  2/2     Running   0          51m
pod/aws-node-mfgjn                                  2/2     Running   0          51m
pod/coredns-6c857f58b4-hhq74                        1/1     Running   0          50m
pod/coredns-6c857f58b4-mn2k2                        1/1     Running   0          65m
pod/efs-csi-controller-bb6f8464b-tjd4j              3/3     Running   0          65m
pod/efs-csi-controller-bb6f8464b-zzrjl              3/3     Running   0          65m
pod/efs-csi-node-6rj6n                              3/3     Running   0          51m
pod/efs-csi-node-kdfmh                              3/3     Running   0          51m
pod/eks-pod-identity-agent-pbk84                    1/1     Running   0          51m
pod/eks-pod-identity-agent-qnh8b                    1/1     Running   0          51m
pod/kube-proxy-d59bz                                1/1     Running   0          51m
pod/kube-proxy-n4vjr                                1/1     Running   0          51m

NAME                                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
service/aws-load-balancer-webhook-service   ClusterIP   172.20.157.47   <none>        443/TCP                  40m
service/kube-dns                            ClusterIP   172.20.0.10     <none>        53/UDP,53/TCP,9153/TCP   8d

NAME                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/aws-node                 2         2         2       2            2           <none>                   8d
daemonset.apps/efs-csi-node             2         2         2       2            2           kubernetes.io/os=linux   6d
daemonset.apps/eks-pod-identity-agent   2         2         2       2            2           <none>                   8d
daemonset.apps/kube-proxy               2         2         2       2            2           <none>                   8d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/aws-load-balancer-controller   2/2     2            2           40m
deployment.apps/coredns                        2/2     2            2           8d
deployment.apps/efs-csi-controller             2/2     2            2           6d

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/aws-load-balancer-controller-6bf4b948d6   2         2         2       40m
replicaset.apps/coredns-6556f9967c                        0         0         0       8d
replicaset.apps/coredns-6c857f58b4                        2         2         2       8d
replicaset.apps/efs-csi-controller-bb6f8464b              2         2         2       6d
 kubectl get all -n gpu-operator
NAME                                                                  READY   STATUS             RESTARTS   AGE
pod/gpu-feature-discovery-nh94s                                       0/1     Init:0/1           0          51m
pod/gpu-feature-discovery-v8fgf                                       0/1     Init:0/1           0          51m
pod/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd66t7t7j   1/1     Running            0          66m
pod/gpu-operator-1714659266-node-feature-discovery-master-5598gsztr   1/1     Running            0          66m
pod/gpu-operator-1714659266-node-feature-discovery-worker-229sp       1/1     Running            0          52m
pod/gpu-operator-1714659266-node-feature-discovery-worker-5z6kj       1/1     Running            0          52m
pod/gpu-operator-cc9db7497-l2s89                                      1/1     Running            0          66m
pod/nvidia-container-toolkit-daemonset-6dt46                          0/1     Init:0/1           0          51m
pod/nvidia-container-toolkit-daemonset-mx4w4                          0/1     Init:0/1           0          51m
pod/nvidia-dcgm-exporter-6nh2x                                        0/1     Init:0/1           0          51m
pod/nvidia-dcgm-exporter-96hww                                        0/1     Init:0/1           0          51m
pod/nvidia-device-plugin-daemonset-jg4d9                              0/1     Init:0/1           0          51m
pod/nvidia-device-plugin-daemonset-r524n                              0/1     Init:0/1           0          51m
pod/nvidia-driver-daemonset-rfj5c                                     0/1     ImagePullBackOff   0          52m
pod/nvidia-driver-daemonset-rgpgh                                     0/1     ImagePullBackOff   0          52m
pod/nvidia-operator-validator-4mkt9                                   0/1     Init:0/4           0          51m
pod/nvidia-operator-validator-9kj2s                                   0/1     Init:0/4           0          51m

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator           ClusterIP   172.20.82.101    <none>        8080/TCP   7d7h
service/nvidia-dcgm-exporter   ClusterIP   172.20.248.145   <none>        9400/TCP   7d7h

NAME                                                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
daemonset.apps/gpu-feature-discovery                                   2         2         0       2            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       7d7h
daemonset.apps/gpu-operator-1714659266-node-feature-discovery-worker   2         2         2       2            2           <none>                                                                 7d7h
daemonset.apps/nvidia-container-toolkit-daemonset                      2         2         0       2            0           nvidia.com/gpu.deploy.container-toolkit=true                           7d7h
daemonset.apps/nvidia-dcgm-exporter                                    2         2         0       2            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               7d7h
daemonset.apps/nvidia-device-plugin-daemonset                          2         2         0       2            0           nvidia.com/gpu.deploy.device-plugin=true                               7d7h
daemonset.apps/nvidia-device-plugin-mps-control-daemon                 0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   7d7h
daemonset.apps/nvidia-driver-daemonset                                 2         2         0       2            0           nvidia.com/gpu.deploy.driver=true                                      7d7h
daemonset.apps/nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 7d7h
daemonset.apps/nvidia-operator-validator                               2         2         0       2            0           nvidia.com/gpu.deploy.operator-validator=true                          7d7h

NAME                                                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                            1/1     1            1           7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-gc       1/1     1            1           7d7h
deployment.apps/gpu-operator-1714659266-node-feature-discovery-master   1/1     1            1           7d7h

NAME                                                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-gc-67c4bd6644       1         1         1       7d7h
replicaset.apps/gpu-operator-1714659266-node-feature-discovery-master-559868b8df   1         1         1       7d7h
replicaset.apps/gpu-operator-cc9db7497                                             1         1         1       7d7h
 kubectl get all -n flask-api-app
NAME                                        READY   STATUS    RESTARTS   AGE
pod/flask-api-deployment-59c668dcf8-wzl6p   0/1     Pending   0          44m

NAME                            TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service/flask-api-app-service   NodePort   172.20.201.77   <none>        80:32235/TCP   44m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/flask-api-deployment   0/1     1            0           44m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/flask-api-deployment-59c668dcf8   1         1         0       44m

Environment Amazon Linux 2 Ubuntu

this is the output of

kubectl describe deployment -n kube-system aws-load-balancer-controller
Name:                   aws-load-balancer-controller
Namespace:              kube-system
CreationTimestamp:      Thu, 09 May 2024 17:00:58 -0400
Labels:                 app.kubernetes.io/instance=aws-load-balancer-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=aws-load-balancer-controller
                        app.kubernetes.io/version=v2.7.2
                        helm.sh/chart=aws-load-balancer-controller-1.7.2
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: aws-load-balancer-controller
                        meta.helm.sh/release-namespace: kube-system
Selector:               app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas:               2 desired | 2 updated | 2 total | 0 available | 2 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=aws-load-balancer-controller
                    app.kubernetes.io/name=aws-load-balancer-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  aws-load-balancer-controller
  Containers:
   aws-load-balancer-controller:
    Image:       public.ecr.aws/eks/aws-load-balancer-controller:v2.7.2
    Ports:       9443/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --cluster-name=EKS-Test-Cluster
      --ingress-class=alb
    Liveness:     http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Readiness:    http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
  Volumes:
   cert:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         aws-load-balancer-tls
    Optional:           false
  Priority Class Name:  system-cluster-critical
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      False   MinimumReplicasUnavailable
OldReplicaSets:  <none>
NewReplicaSet:   aws-load-balancer-controller-6bf4b948d6 (2/2 replicas created)
Events:          <none>
aravindsagar commented 6 months ago

Hi! Thanks for reporting the issue. Would you be able to share the controller logs? This can help us get to the cause of the issue. Thanks!

asluborski commented 6 months ago

Hello, issue was unrelated to ALB. I was using GPU nodes with Amazon Linux 2, tried to install a driver tagged with AL2 that does not exist. I have since moved the OS to bottlerocket NVIDIA and everything is working.