cluster-autoscaler [AWS] isn't aware of LoadBalancer inflight requests causing 502s when external traffic policy is set to Cluster

I have two deployments behind a service of type Load Balancer and External Traffic Policy set to Cluster. Deployment (say A) behind a service of type Load Balancer(A) and deployment (say B) also behind a service of type Load Balancer(B). I am also using cluster-autoscaler to scaling my worker node. Deployment App A is my app server, Deployment B is my web-server which would forward all the requests to Load Balancer A of Deployment A workloads. The RTT for each request is around 10-20 seconds. (To reproduce the issue, it wrote a sample APP to include 20 second sleep).

Whenever, I add new deployment workload(say C) to my Cluster, cluster-autoscaler would add new nodes to fulfill the workload request. Whenever, I delete the deployment workload(C), the cluster-autoscaler would scale down the worker nodes (Drain -> Terminate).

As my external traffic policy is set to Cluster, all the new nodes that are joined to the cluster are also registered to the load balancer. However, when cluster-autoscaler deletes a node(let say Node 10), and all the requests which are on Node10 are closed as the node is marked for termination by cluster-autoscaler as there is no workload running resulting in 502's for Service A. This is because cluster-autoscaler it is not aware of active/inflight requests on the node10 for Service A and Service B resulting in 502's as the request is interrupted because of node termination/draining.

Work around: Change external traffic policy to Local

Ask: Make cluster-autoscaler more resilient and make it aware of inflight request when external traffic policy is set to Cluster.

Deployment A manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: limits-nginx
spec:
  selector:
    matchLabels:
      run: limits-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: limits-nginx
    spec:
      containers:
      - name: limits-nginx
        image: nithmu/nithish:sample-golang-app
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "264Mi"
            cpu: "250m"
          limits:
            memory: "300Mi"
            cpu: "300m"
===
Service A manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"sameple-go-app",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run": "limits-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"limits-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment B manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nithmu/nithish:nginx_echo
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
===
Service B manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"my-nginx",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run":"my-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"my-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment C manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: l-nginx
spec:
  selector:
    matchLabels:
      run: l-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: l-nginx
    spec:
      containers:
      - name: l-nginx
        image: nginx
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "1564Mi"
            cpu: "2000m"
          limits:
            memory: "1600Mi"
            cpu: "2500m"
===

We have the same (or similar) issue when using alb-ingress-controller with cluster-autoscaler, during scale down we are getting 5XX, this is because CA doesn't wait for the target group to deregister the target, In our case we are using alb.ingress.kubernetes.io/target-type=ip.

I came up with this solution to make the pods "wait" till they are deregistered from the target-group:

- name: wait-till-deregistered
  image: public.ecr.aws/bitnami/aws-cli:2.11.23
  command:
    - /bin/bash
    - -c
  args:
    - |
      CLUSTER_NAME="my-cluster";
      INGRESS_GROUP="${CLUSTER_NAME}-private";
      DEPLOYMENT_NAME="my-workload";
      SERVICE_NAME="my-workload";
      PORT_NAME="http";
      STACK_FILTER="Key=ingress.k8s.aws/stack,Values=${INGRESS_GROUP}";
      RESOURCE_FILTER="Key=ingress.k8s.aws/resource,Values=${CLUSTER_NAME}/${DEPLOYMENT_NAME}-${SERVICE_NAME}:${PORT_NAME}";
      TG_ARN=$(aws resourcegroupstaggingapi get-resources --resource-type-filters elasticloadbalancing:targetgroup --tag-filters ${STACK_FILTER} --tag-filters ${RESOURCE_FILTER} --query "ResourceTagMappingList[*].ResourceARN | [0]" --output text);

      echo "Waiting until target Id=${MY_POD_IP} is deregistered";
      until aws elbv2 wait target-deregistered --target-group-arn ${TG_ARN} --targets Id=${MY_POD_IP};
      do
          echo "Still waiting...";
          sleep 1;
      done;
      echo "target Id=${MY_POD_IP} has been deregistered";
  env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  securityContext:
    allowPrivilegeEscalation: false
    runAsUser: 0

This is a side-container in our frontend workloads, all it does is wait until the pod IP is deregistered, and finally exits. I find this approach a lot more easier then using Lambda+ASG lifecycle hooks (no need for new TF code and/or CI/CD pipelines).

Keep in mind I only tested this on target-type=ip since we register our pods to target groups and not the instances, so YMMV.

externalTrafficPolicy: Cluster

hello @infa-ddeore. could you please share your code about deregister-node-from-lb-before-terminating=true ? Thanks

@SCLogo, i cant share the customized code, but its small change where we are adding node.kubernetes.io/exclude-from-external-load-balancers=true label to the node during cordon option, deregister-node-from-lb-before-terminating argument is added to the binary

We have the same (or similar) issue when using alb-ingress-controller with cluster-autoscaler, during scale down we are getting 5XX, this is because CA doesn't wait for the target group to deregister the target, In our case we are using alb.ingress.kubernetes.io/target-type=ip.

I came up with this solution to make the pods "wait" till they are deregistered from the target-group:
- name: wait-till-deregistered
  image: public.ecr.aws/bitnami/aws-cli:2.11.23
  command:
    - /bin/bash
    - -c
  args:
    - |
      CLUSTER_NAME="my-cluster";
      INGRESS_GROUP="${CLUSTER_NAME}-private";
      DEPLOYMENT_NAME="my-workload";
      SERVICE_NAME="my-workload";
      PORT_NAME="http";
      STACK_FILTER="Key=ingress.k8s.aws/stack,Values=${INGRESS_GROUP}";
      RESOURCE_FILTER="Key=ingress.k8s.aws/resource,Values=${CLUSTER_NAME}/${DEPLOYMENT_NAME}-${SERVICE_NAME}:${PORT_NAME}";
      TG_ARN=$(aws resourcegroupstaggingapi get-resources --resource-type-filters elasticloadbalancing:targetgroup --tag-filters ${STACK_FILTER} --tag-filters ${RESOURCE_FILTER} --query "ResourceTagMappingList[*].ResourceARN | [0]" --output text);

      echo "Waiting until target Id=${MY_POD_IP} is deregistered";
      until aws elbv2 wait target-deregistered --target-group-arn ${TG_ARN} --targets Id=${MY_POD_IP};
      do
          echo "Still waiting...";
          sleep 1;
      done;
      echo "target Id=${MY_POD_IP} has been deregistered";
  env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  securityContext:
    allowPrivilegeEscalation: false
    runAsUser: 0
This is a side-container in our frontend workloads, all it does is wait until the pod IP is deregistered, and finally exits. I find this approach a lot more easier then using Lambda+ASG lifecycle hooks (no need for new TF code and/or CI/CD pipelines).

Keep in mind I only tested this on target-type=ip since we register our pods to target groups and not the instances, so YMMV.

nice workaround solution. I face same issue and added 10 seconds prestop hook and it didn't solve it. I had thinking to increase the time to 20s before see you solution. but is it same with to add more preStop hook?

kubernetes / autoscaler

cluster-autoscaler [AWS] isn't aware of LoadBalancer inflight requests causing 502s when external traffic policy is set to Cluster #1907