cluster-autoscaler [AWS] isn't aware of LoadBalancer inflight requests causing 502s when external traffic policy is set to Cluster

nithu0115 commented 5 years ago

I have two deployments behind a service of type Load Balancer and External Traffic Policy set to Cluster. Deployment (say A) behind a service of type Load Balancer(A) and deployment (say B) also behind a service of type Load Balancer(B). I am also using cluster-autoscaler to scaling my worker node. Deployment App A is my app server, Deployment B is my web-server which would forward all the requests to Load Balancer A of Deployment A workloads. The RTT for each request is around 10-20 seconds. (To reproduce the issue, it wrote a sample APP to include 20 second sleep).

Whenever, I add new deployment workload(say C) to my Cluster, cluster-autoscaler would add new nodes to fulfill the workload request. Whenever, I delete the deployment workload(C), the cluster-autoscaler would scale down the worker nodes (Drain -> Terminate).

As my external traffic policy is set to Cluster, all the new nodes that are joined to the cluster are also registered to the load balancer. However, when cluster-autoscaler deletes a node(let say Node 10), and all the requests which are on Node10 are closed as the node is marked for termination by cluster-autoscaler as there is no workload running resulting in 502's for Service A. This is because cluster-autoscaler it is not aware of active/inflight requests on the node10 for Service A and Service B resulting in 502's as the request is interrupted because of node termination/draining.

Work around: Change external traffic policy to Local

Ask: Make cluster-autoscaler more resilient and make it aware of inflight request when external traffic policy is set to Cluster.

Deployment A manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: limits-nginx
spec:
  selector:
    matchLabels:
      run: limits-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: limits-nginx
    spec:
      containers:
      - name: limits-nginx
        image: nithmu/nithish:sample-golang-app
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "264Mi"
            cpu: "250m"
          limits:
            memory: "300Mi"
            cpu: "300m"
===
Service A manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"sameple-go-app",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run": "limits-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"limits-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment B manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nithmu/nithish:nginx_echo
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
===
Service B manifest file
===
{
   "kind":"Service",
   "apiVersion":"v1",
   "metadata":{
      "name":"my-nginx",
      "annotations":{
        "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
        "service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled": "true"
      },
      "labels":{
         "run":"my-nginx"
      }
   },
   "spec":{
      "ports": [
         {
           "port":80,
           "targetPort":8080
         }
      ],
      "selector":{
         "run":"my-nginx"
      },
      "type":"LoadBalancer"
   }
}
===
Deployment C manifest file 
===
apiVersion: apps/v1
kind: Deployment
metadata:
  name: l-nginx
spec:
  selector:
    matchLabels:
      run: l-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: l-nginx
    spec:
      containers:
      - name: l-nginx
        image: nginx
        env:
        - name: MSG_ENV
          value: "Hello from the environment"
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "1564Mi"
            cpu: "2000m"
          limits:
            memory: "1600Mi"
            cpu: "2500m"
===

Jeffwan commented 5 years ago

/assign

Jeffwan commented 5 years ago

I will try to add grace period between node drain and node delete.

Can we also try to use preStop hook to delay few seconds to give EndpointController enough time to delete the pod from service? At that time, even eviction request are send, pod can still serve traffic until it's removed from endpoint lists.

lifecycle:
  preStop:
    exec:
      command: [
        "sh", "-c",
        # Introduce a delay to the shutdown sequence to wait for the
        # pod eviction event to propagate. Then, gracefully shutdown nginx.
        "sleep 5 && /usr/sbin/nginx -s quit",
      ]

krzysztof-jastrzebski commented 5 years ago

We experienced similar issue with Ingress and we are working with networking team to solve it. The solution is to use annotations and taints so Ingress controller can drain connections before Cluster Autoscaler removes node. Solution:

Ingress controller adds/removes "delay-deletion.cluster-autoscaler.kubernetes.io/ingress" annotation to all nodes used by Ingress when node is added to/removed from ingress. We add it for all nodes as the annotation needs to be added before Cluster Autoscaler adds taints so there is no race condition.
When node is scaled down then Cluster Autocaler adds a taint (ToBeDeletedByClusterAutoscaler) on node and marks it as a deletion candidate, drains the node and waits for some time (see point 5).
Ingress controller watches for the taint (ToBeDeletedByClusterAutoscaler) and removes node from ingress when a node is has taint.
Ingress controller removes the "delay-deletion.cluster-autoscaler.kubernetes.io/ingress" annotation on node after node is removed from ingress.
Cluster Autoscaler removes node after all annotations with prefix "cluster-autoscaler.kubernetes.io/dealy-deleting-node-" are removed or if annotation is not removed within 5 minutes (timeout defined in flag). Above solution can be used with other controllers. It is enough to use different annotation with prefix "delay-deletion.cluster-autoscaler.kubernetes.io/".

Jeffwan commented 5 years ago

We experienced similar issue with Ingress and we are working with networking team to solve it. The solution is to use annotations and taints so Ingress controller can drain connections before Cluster Autoscaler removes node. Solution:

Ingress controller adds/removes "delay-deletion.cluster-autoscaler.kubernetes.io/ingress" annotation to all nodes used by Ingress when node is added to/removed from ingress. We add it for all nodes as the annotation needs to be added before Cluster Autoscaler adds taints so there is no race condition.

When node is scaled down then Cluster Autocaler adds a taint (ToBeDeletedByClusterAutoscaler) on node and marks it as a deletion candidate, drains the node and waits for some time (see point 5).

Ingress controller watches for the taint (ToBeDeletedByClusterAutoscaler) and removes node from ingress when a node is has taint.

Ingress controller removes the "delay-deletion.cluster-autoscaler.kubernetes.io/ingress" annotation on node after node is removed from ingress.

Cluster Autoscaler removes node after all annotations with prefix "cluster-autoscaler.kubernetes.io/dealy-deleting-node-" are removed or if annotation is not removed within 5 minutes (timeout defined in flag). Above solution can be used with other controllers. It is enough to use different annotation with prefix "delay-deletion.cluster-autoscaler.kubernetes.io/".

That's great! Which ingress controller are you using? Sounds like a customized one. To me, I think only concern is the solution couples tight with cluster autoscaler. The current solution we want to move forward is adding grace period before node deletion (cover empty node and drain node case). This is not perfect because CA doesn't know how long to wait since it's not aware of user requests. I think like 5s will resolve most of the issues.

At the same time, I am looking for native ASG support and if we can leverage some feature when CA terminate nodes in ASG. that would be perfect. Till now, I have not figured it out.

MaciekPytel commented 5 years ago

The current solution we want to move forward is adding grace period before node deletion

Grace period only helps if ingress controller is already removing the nodes from LB. How would it know to start doing so? I don't think it reacts to just draining pods (and why would it?). Seems like there needs to be some custom logic on ingress side to start draining connections from node once it notices the CA wants to remove the node (point 2 in @krzysztof-jastrzebski idea).

Jeffwan commented 5 years ago

@MaciekPytel kubectl drain will change node.Spec.NoSchedulable to true. Service Controller will periodically fetch node information and ensure Load Balancer.

https://github.com/kubernetes/kubernetes/blob/32b37f5a85e8f2987f8367e611147cd4d4dfa4f9/pkg/controller/service/service_controller.go#L588-L594 (I am confused why it's still using fetch rather than watch..There's a 100s nodeSyncPeriod)

Logic here will make sure nonsheduable node are removed from load balancer. This is common logic and I think the idea is not to couple with ingress/service controller logic with CA or normal drain.

To make it generic, I think CA should also still use this field in the custom drain logic in CA. maybe like this https://github.com/Jeffwan/autoscaler/commit/0c02d5bed0d8555187a2b1b289e1044c6f9e2b5c#diff-5f36cd12baa8998cd7f2ba7c4a00cbc6

krzysztof-jastrzebski commented 5 years ago

We use taint(ToBeDeletedByClusterAutoscaler) instead of unschedulable because sometimes we are not able to finish drain. There might be PodDesruptionBudget which prevents drain from succeeding within 10 minutes. In such case we only remove taint. Unschedulable is used by other things. Due to that we wouldn't know if we can change unschedualable back to false after drain fails as it could be set to true by user or other mechanisms before or during drain. Adding sleep would help only if service controller/ingress would watch taint. Such solution is similar to proposed by us but it is less reliable and also needs changes in service controller/ingress controller.

Jeffwan commented 5 years ago

We use taint(ToBeDeletedByClusterAutoscaler) instead of unschedulable because sometimes we are not able to finish drain. There might be PodDesruptionBudget which prevents drain from succeeding within 10 minutes. In such case we only remove taint. Unschedulable is used by other things. Due to that we wouldn't know if we can change unschedualable back to false after drain fails as it could be set to true by user or other mechanisms before or during drain. Adding sleep would help only if service controller/ingress would watch taint. Such solution is similar to proposed by us but it is less reliable and also needs changes in service controller/ingress controller.

That's true. I agree. Do you know does GKE have same problem for core v1.service? Looks like this is a common issue?

krzysztof-jastrzebski commented 5 years ago

AFAIK GKE L4 load balancer doesn't have this problem but I don't know how does it work. You probably can ask sig-networking.

markine commented 5 years ago

@krzysztof-jastrzebski So I understand that the CA uses taints with NoSchedule effect when deleting nodes instead of setting nodes.Spec.Unschedulable=true. This is done because in the case of a non-empty node we are unsure whether all evictions will succeed before we delete a node via the cloud provider: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scale_down.go#L990

Now consider that the k8s service controller computes the list of load balancer targets based on the node.Spec.Unschedulable flag and not based on node taints: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/service/service_controller.go#L554

There are thus two points to consider in the interaction of CA and service controller during node deletes: a) It is good that the CA uses taints before evicting pods: we don't want to kick the node out of ELB target lists before we know that we're going to delete the node for sure. b) We have a race because we never set node.Spec.Unschedulable prior to deleting the node via the cloud provider: ELBs may still round-robin traffic to a deleted node before k8s realizes the node is gone and the service controller updates all ELBs.

I think this issue exists independent of whether we set External Traffic Policy to Local or Cluster. Setting the policy to Local just makes the race less probable: node deletion only affects services with endpoint pods that used to be on the dying node up until a few seconds ago. This is in contrast to node deletion of any node to affect all services for which the traffic policy is set to Cluster.

To conclude, can the CA set node.Spec.Unschedulable and sleep right before deleting a node via the cloud provider? Alternatively perhaps we could use the ServiceNodeExclusion feature and label the dying node with "alpha.service-controller.kubernetes.io/exclude-balancer" and sleep before destroying the node?

MaciekPytel commented 5 years ago

It is good that the CA uses taints before evicting pods: we don't want to kick the node out of ELB target lists before we know that we're going to delete the node for sure.

Is that so? If we are trying to delete a node wouldn't that be a good enough reason to remove it from LB? Perhaps we should remove it so that removing from LB happens at the same time as draining pods. The caveat is if we failed to drain, we may need to re-add the node to LB (I don't know how to do that though).

Alternatively perhaps we could use the ServiceNodeExclusion feature and label the dying node with "alpha.service-controller.kubernetes.io/exclude-balancer"

That's an interesting idea. What happens if the annotation is removed? Is the node added back?

sleep before destroying the node?

That part is not ideal. First of all how long would CA sleep? Of course it would be flag controlled, but what's the default or even recommendation? I imagine it will be very environment dependent. Even more importantly - CA already scales down non-empty nodes very slowly. Draining is serial and takes at least a minute per node. Imagine you have 300 nodes worth of spike - it will take many hours to scale back down. I really don't like the idea of introducing additional sleeps that will make this problem even worse, which brings me back to: can we start removing node from LB when we start draining, not when we finish it? That would pretty much imply we need to change service controller so it can trigger on a different signal than spec.Unschedulable. We could make it watch for CA taint (which I think is what @krzysztof-jastrzebski proposed) or we can try to use some other signal.

markine commented 5 years ago

It is good that the CA uses taints before evicting pods: we don't want to kick the node out of ELB target lists before we know that we're going to delete the node for sure.

Is that so? If we are trying to delete a node wouldn't that be a good enough reason to remove it from LB? Perhaps we should remove it so that removing from LB happens at the same time as draining pods. The caveat is if we failed to drain, we may need to re-add the node to LB (I don't know how to do that though).

The way I understand things, removing a node from the load balancers takes down service endpoints that live on that node. Removing a node from a load balancer prior to a drain has potential of subtly violating a PodDisruptionBudget. Example:

A node has PodA, PodB, PodC. PodC is a sole endpoint of Service1. PodC has an associated PodDisruptionBudget set up disallowing eviction. The node is removed from the load balancer before drain starts. PodC endpoint is deemed unavailable in Service1 very quickly (if not by the service itself then by the fact that the LB has been updated), even though PodC is still alive. The drain fails at PodC due to the PDB (or earlier). PodC stays alive but Service1 stays down until we program the LBs to include the node again.

I understand that we don't want to stretch out draining a 300-node spike. Sleeping to make sure the LBs are updated and stop forwarding traffic to dying nodes seems necessary, even if that means modifying the code to run drains in parallel.

krzysztof-jastrzebski commented 5 years ago

Good Point! Good thing is that this will only be a problem when using ServiceExternalTrafficPolicyTypeLocal. If we use ServiceExternalTrafficPolicyTypeCluster then traffic will go through another node to our pod. We are working on that with networking team. We probably won't fix this in 1.15 but we will try to address it in 1.16.

raravena80 commented 5 years ago

Any ETA on the fix for this issue? We are seeing the same. Thx.

patrat commented 5 years ago

We've run into this while using cluster-autoscaler together with alb-ingress-controller. The workaround we've identified is to create a Lambda function triggered on the AWS EC2 Instance-terminate Lifecycle Action:

1) sets the alpha.service-controller.kubernetes.io/exclude-balancer": "true" tag on the node - so that the alb-ingress-controller doesn't add the node back 2) deregisters the node for the ALB target(by this initiating ALB draining)

This is our implementation https://github.com/syscollective/kube_alb_lambda_deregister

devinburnette commented 4 years ago

Hi, just ran into this ourselves. Surfacing as 500s in our applications. Would be great to have some clarity around what the autoscaling team is thinking around potential solutions for this or confirmation that this is indeed a bug in how the CA scales down nodes. Let me know if I can be a resource in any way.

jorihardman commented 4 years ago

@Jeffwan - Would you mind weighing in from the AWS side on this issue please? Based off this thread and my own experiences, cluster-autoscaler is currently incompatible with the LoadBalancer service. Our EKS nodes are constantly scaling up and down to accommodate backend queue load, and this issue causes constant disruption to requests to application server pods running on the cluster. Whenever cluster-autoscaler scales down a node, we see a handful of 504s:

This is over the last 3 days, and every single one aligns with an ASG scale down triggered by cluster-autoscaler. It's a major issue for us right now, and it's not clear how we can achieve scalability of the cluster along with stable external load balancing.

Jeffwan commented 4 years ago

Let me double check this issue after kubecon week. Due to the implementation of service controller and draining logic of CA, I am not sure if there's a simple solution. Thanks for keep raising this issue up.

bshelton229 commented 4 years ago

We've been able to solve for this by making some minor adjustments to our existing node draining daemonset/controller rig we've had running, as well as making sure our ELBs created by services had connection draining enabled.

All of our AWS autoscaling groups (ASGs) have termination lifecycle hooks. There are lots of different rigs out there (and new EKS node groups look to have this built in), but basically when nodes are sent the termination api call our daemonset kicks in and runs kubectl drain .... on the node, waits for it to drain, and then completes the ASG lifecycle hook, which tells the ASG it's then safe to actually terminate the node. The trick was not completing the lifecycle hook until connections were drained from various ELBs, not just waiting for the pods to be evicted. The node is still a part of production routing until any in-flight requests are finished using it, regardless of what pods were running on it.

The two adjustments we made to eliminate our errors were:

We made sure all our services had the service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout annotation, which we had missed on the problem ELBs.
In our daemonset responsible for the kubectl drain we made sure that at least the time of our longest elb drain timeout had elapsed since the kubectl drain command was FIRST called before completing the lifecycle hook. So, if you want to drain connections for 60 seconds in your ELB, you'll want to make sure at the very least, 60 seconds have gone by since you first call kubectl drain, before telling the ASG it can terminate the instance. As far as I can tell, as soon as kubectl drain is called, it marks the node in such a way as to tell the control plane to start deregistering it from all ELBs. We may possibly add polling to make sure the instance is gone from all elbs and target groups in the future, rather than relying on a coarse sleep.

The nice part about the ASG lifecycle hook and daemonset is it will be called after cluster autoscaler declares the node should be gotten rid of. The kubectl drain ... should run relatively fast as cluster autoscaler should have started evictions, hence why we had to add the additional sleep (every problem in k8s is solved by adding a sleep, just like the '90s) between the drain call finishing and the lifecycle hook being completed. We need the node to stay around while requests are still potentially being allowed to drain.

We're initially excited about the new EKS node groups, but I need to look to see if they have enough flexibility to add this kind of logic to the drain mechanism AWS put in them. We haven't gotten to play with them yet. I don't see why we couldn't cleanup and open source our take on the draining rig, I got most of it from the kube-aws project, and modified it slightly.

This issue may best be solved in a manner like this, as CA calls an API in AWS that is actually event driven. We just put this fix in place, and so far it's gotten rid of 100% of the errors, but it's only been one day. I'll report back if they crop up again. Before this fix, we saw errors every single scaling event.

Semi-Related Realizations

After fixing this, it dawned on us how odd this routing strategy really is. We knew it worked this way, but this drove it home. Our ingress pods fronted by the ELB throwing these errors were already running on dedicated nodes in an ASG that doesn't have CA enabled at all, as we're more careful with our ingress pods given how much traffic goes through them. This really drove home that, even with that configuration, our highly volatile node group doing ETL data processing was actually serving production web traffic. Sounds crazy when you write it out.

We initially got really excited about externalTrafficPolicy Local, as it would only have nodes that contained the pods passing health checks in the ELB. However, at least in our testing we realized the Local policy has huge issues when updating deployments. As soon as a pod starts even a safe termination procedure (using preStop scripts with connection draining in the pod) traffic routing to the safely stopping pods stops working instantly, but it will take the ELB seconds to fail the health check that will take the instance with a safely draining pod out of rotation. This means that if a deployment update brings up a new pod on a new node, draining the old pod on a different node can't be made safe; at least as far as I could tell.

So, back to serving traffic from every single node in our cluster using the nginx ingress with this fix in place. We're looking deeply at the ALB ingress at this point, using the IP routing policy https://kubernetes-sigs.github.io/aws-alb-ingress-controller/guide/ingress/annotation/#traffic-routing. That seems fairly promising in testing so far.

jorihardman commented 4 years ago

@bshelton229 Thanks for the update. I've actually replaced our cloudformation nodegroups with the new eks managed nodegroups and still experience the same issue - this is with a 30s drain enabled. I have a feeling that there's some sort of race condition between deregistering the node from the load balancer the drain removing kube-proxy. I haven't had time to investigate it, but suffice it to say that a drain on the terminate event doesn't seem to solve the issue in my experience.

Similar to you, I wanted to give aws-alb-ingress-controller a shot with IP routing and unfortunately I experienced another similar race condition: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064. Obviously still investigating that, but since it was not able to offer me smooth deployments, I've reverted back to the LoadBalancer service. Better the devil you know than the one you don't I suppose.

If you get a stable setup with aws-alb-ingress-controller, I'd love to hear about it! Based on the documentation, it does seem to be a solution to this problem - maybe I just configured something wrong.

mollerdaniel commented 4 years ago

@jorihardman I'm running stable critical production workloads with no 502s or 504s using aws-alb-ingreess-controller during scale-in.

My plan is to deprecate all Services of Type Loadbalancer and replace all with aws-alb-ingress-controller.

My setup uses aws-alb-ingress-controller -> haproxy-ingress but instead of haproxy-ingress, it could be any app.

All environments has a http timeout, in my case it's 300s. Lets call it HTTPTIMEOUT

I use the following settings:

ALB in IP-mode
target is a Deployment, not a DaemonSet (maybe ds would work, I haven't tested it)
target has qosClass burstable or guaranteed, not bestEffort (hint --expendable-pods-priority-cutoff)
CA has --expendable-pods-priority-cutoff=HTTPTIMEOUT+60
ALB set to use deregistration_delay.timeout_seconds=HTTPTIMEOUT
target Deployment has a lifecycle preStop hook sleep of HTTPTIMEOUT+ some margin for the controlplane to signal to the ALB-controller to drain the backend. I did use 30s.

In my case the confirmation for it to work is in the logs, where CA is waiting and respecting the shutdown of your pods before killing the nodes:

E1203 22:23:05.329149 1 scale_down.go:1041] Not deleted yet &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:haproxy-ingress-77757786cc-x4lsm,GenerateName:haproxy-ingress-77757786cc-.........etc

bshelton229 commented 4 years ago

Node Draining Solution

@jorihardman I can now confirm that the fix I put in place in our node draining has solved this particular problem for us. This chart (numbers not to scale) shows the k8s node count in the autoscaling pool against ELB 5xx errors for our nginx ingress serviceType LoadBalancer. We haven't seen any direct ELB 5xx since our draining fix.

Node draining before completing the ASG lifecycle hook won't solve this issue in itself. The key is to wait long enough after marking the node unschedulable for the node to be removed and drained from all ELBs it was automatically added to by k8s before you complete the lifecycle hook killing the node. I haven't found this ability in any generic node draining rigs yet. I'm not surprised it doesn't work with the default drainer in the new EKS node groups. I'll hopefully have a chance to spin one of those up in the near future, and will hopefully be able to inspect what is likely a lambda rig, and see if it can be modified to do what we put in place in our drain rig.

Most drain techniques won't remove daemonset pods, so kube-proxy should continue to run until the instance itself is gone, which in my testing continues to allow it to side-route traffic to pods. There are myriad node draining rigs out there, and it appears "EKS Node Groups" come stock with yet another one. We can use this one, github.com/aws-samples/amazon-k8s-node-drainer to illustrate what we found to be the problem.

https://github.com/aws-samples/amazon-k8s-node-drainer/blob/master/drainer/handler.py#L149

        cordon_node(v1, node_name)

        remove_all_pods(v1, node_name)

        asg.complete_lifecycle_action(LifecycleHookName=lifecycle_hook_name,
                                      AutoScalingGroupName=auto_scaling_group_name,
                                      LifecycleActionResult='CONTINUE',
                                      InstanceId=instance_id)

That is the basic runbook in the handler for the drain. A procedure like this would fire once CA has decided the node should be removed, so CA has already evicted most of the pods itself. asg.complete_lifecycle_action will actually destroy the instance. This is a very common and sane thing to do, and is similar to what we had in place; but it will still drop connections during scale-down with servicetype loadbalancer. Because, CA evicts most/all pods (but not kube-proxy) and eventually will terminate the node by calling the correct hook which will allow draining rigs to kick into gear. Taking the above code, the draining would look like:

cordon_node is called, which marks the node unschedulable (@Jeffwan 's PR will do this earlier, as part of the CA process, which may almost completely solve this issue if that merges). When the node is marked unschedulable k8s will START to remove it from all ELBs it was added to, honoring any drain settings. Remember this node may be actively holding connections EVEN THOUGH it may have nothing but kube-proxy running, because of how servicetype loadbalancer works. Cordon is non-blocking and will return virtually immediately.
remove_all_pods is called, which should evict all pods that aren't daemonsets. Again, this should leave kube-proxy running and still allow the node to side-route traffic to pods. This will likely run very quickly, or immediately, because CA has likely already evicted the pods before this chain of events starts.
asg.complete_lifecycle_action is called telling AWS it can actually destroy the node itself, which will stop kube-proxy (obviously) breaking any connections still routing through kube-proxy.

The issue is it's probably not safe to actually stop the node just because all pods have been evicted. cordon_node is non-blocking, and only signals that k8s should start the process of removing the nodes from the elbs, but doesn't wait (and shouldn't) until the nodes are actually removed from the ELBs. In our case, we have a 300s elb drain configured, so we should wait at least 300 seconds after cordon_node before terminating the node with asg.complete_lifecycle_action. Our solution was to add logic between remove_all_pods and asg.complete_lifecycle_action. Our logic right now is to make sure we've slept at least as long as our longest ELB drain after calling cordon and before calling asg.complete_lifecycle_action. We plan to add an actual check to make sure k8s has removed the instance from all ELBs on its own before subsequently calling the lifecycle hook, rather than relying on an arbitrary sleep. A nearly arbitrary sleep is, however, the kubernetes way. Most of these drain procedures aren't dealing with the fact that the node is possibly still handling production traffic when all pods, save for kube-proxy and daemonsets, are gone.

I think @Jeffwan 's fix of having CA do the cordon or unschedulable will likely solve almost all of this as well. Our drain solution has proved fairly foolproof for us thusfar. We just had to orient ourselves to the fact that these nodes were possibly still serving nginx ingress traffic even though they had no more pods running on them and were never able to have nginx ingress pods running on them due to taints and tolerations.

Every time I write this stuff out I realize we need to move off of servicetype loadbalancer as soon as possible; and we're getting close :).

ALB Ingress

@mollerdaniel gave some great guidance on successfully using the ALB ingress with zero downtime deploys. We've also had good success with it. There are definitely tricks to it, including the deregistration delay stuff and preStop termination hooks in pods; as well as readiness in the pod not being the same as readiness in the target group. The trick we saw with the ALB ingress is the pod readiness isn't actually integrated with the ALB controller in IP mode. Meaning, as soon as the pod becomes "ready" as far as k8s is concerned, only then is the pod IP initially added to the target group. Now, this doesn't mean it's actually ready for traffic in the ALB target group yet, even though k8s has put it in the endpoints API and then proceeds to terminate other pods the new pods are replacing. This is because once the pod IP is added to the target group, it then has to pass the ALB health checks to become "ready" there. These two readiness checks aren't aware of each other at all. We had to get our sleeps and timeout in our preStop all synced up with the drains as has been mentioned.

We are also looking at fully replacing all ELBs with the ALB ingress controller. I think @mollerdaniel also mentioned this, but we're going to start by fronting our nginx ingress service with the ALB ingress controller, which sounds weird, but should work. The nginx ingress controller creates pods that serve the actual traffic and will balance the upstream pods by adding the IPs directly to the nginx upstream blocks via a lua plugin. You normally front the nginx ingress pods themselves with servicetype LoadBalancer. We're going to front our nginx ingress pods with an ingress document configured for the alb ingress controller, basically putting an ALB in front of our nginx ingress pods that only contains the nginx pod IPs. The ingress pods themselves don't re-deploy very often, while the services we have behind nginx ingress are literally continuously deploying during the day. We deploy about every 15 minutes.

jorihardman commented 4 years ago

@bshelton229 really appreciate the thorough write-up. Glad I'm not alone in all this trial-and-error. It looks like I have tentatively worked around this issue by switching to alb-ingress-controller. I don't want to crowd this thread with details on a completely different solution, but you can find my findings here: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064.

fradee commented 4 years ago

Cordon nodes -> #2868

There is PR https://github.com/kubernetes/autoscaler/pull/3014 for possible solution 504 errors. Need code review.

nemo83 commented 4 years ago

I'm here for the same issue. Any workaround? Thanks

vukomir commented 4 years ago

I'm also facing the same issue,

managed to reduce 5xx error but in some cases i will still get them, aws-alb-ingress-controller is set in IP mode ALB set to use deregistration_delay.timeout_seconds=300 same timeout and Deployment has a lifecycle preStop with 330sec

during application update in some cases i can see 5xx error

dictcp commented 4 years ago

To conclude, can the CA set node.Spec.Unschedulable and sleep right before deleting a node via the cloud provider? Alternatively perhaps we could use the ServiceNodeExclusion feature and label the dying node with "alpha.service-controller.kubernetes.io/exclude-balancer" and sleep before destroying the node?

let say we can the CA set node.Spec.Unschedulable after drainSuccessful = true https://github.com/kubernetes/autoscaler/blob/33bd5fc853f91bdfdc548052ccb453d5fea348c6/cluster-autoscaler/core/scale_down.go#L1140

for the "sleep", i think how long we need to wait, depends on the implementation of the load balancer / service / ingress. How about the sleep is implemented via annotation delay-deletion.cluster-autoscaler.kubernetes.io/xxxx.

Then for the those who wish to have a quick termination (in case 300 node scale up and down), they can leave it blank. For those who are using in-tree LB-type service, they can wait for 120 sec (which is longer than the nodeSyncPeriod), to make sure it is safe to terminate the node.

atulaggarwal commented 3 years ago

We were also facing similar issue and we came across a PR in autoscaler which marks the node as cordon so that ALB can remove it from healthy list. We tried it with following changes and we could avoid 5xx error due to autoscaling fully. (This is tested in our staging env as of now and we would we pushing the change to prod env in next week). Steps

Cordon the node along with taint to be done in autoscaler code (PR - https://github.com/kubernetes/autoscaler/pull/3649). I have created docker image on top of 1.18 branch too (hosted at https://hub.docker.com/r/atulagrwl/autoscaler/ tag 1.18-c, cordon is enabled by default in this image).
Default degregistration delay is of 300 sec. Increase the delay which ever is needed for inflight request for the use case. (https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay)
Enable livecycle termination hooks on ASG for the worker nodes. (https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html). Create a new livecycle hook for Lifecycle transition as Instance terminate with Default Result as Continue and add required time to wait in seconds in Heartbeat timeout for the node to be deleted post termination.

Using above steps we were able to reduce the errors to 0 due to downscaling. Please vote for above PR so that we can get that merged into autoscaler to avoid maintaining it seperately. I will also spend sometime to move Item3 into autoscaler to wait for configured time before sending termination to node.

dictcp commented 3 years ago

there are some update on service lb controller behaviour recently

https://github.com/kubernetes/kubernetes/issues/65013

https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-624924857

node.kubernetes.io/exclude-from-external-load-balancers which is beta and on by default in 1.19.

https://github.com/kubernetes/kubernetes/pull/90823

so after this PR, we may consider to set Unschedulable (aka cordon) earlier

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

raravena80 commented 3 years ago

/remove-lifecycle stale

atulaggarwal commented 3 years ago

Updates

ALB Load Balancer PR has been merged into main branch (for aws load balancer v2). It has been released into v2.1.1 release. It ensures that nodes tained with ToBeDeletedByClusterAutoscaler taint will be removed from healthy nodes. So that current version of cluster autoscaler (without UnScheduler taint) also works fine.
Cluster Autoscaler PR has also been merged. This PR adds Unschedulable taint apart from existing ToBeDeletedByClusterAutoscaler taint to the nodes which are being removed from Cluster. As per my understanding the change is in the master branch and has not been released yet.

Both of the changes are not required. Either one of the one would do at alternative to Step 1. However Step 2 and Step 3 is also required to ensure you keep the node for few mins till all the previous connections responses are returned back.

We have been running this solution in our production cluster for more than 3 months and we have not observed any issue related to cluster autoscaler yet (with our custom docker image and would be switching to aws v2.1.1 soon and move away from custom docker image). Hopefully this helps the teams facing similar issue. It took us several days to identify and fix the issue (all thanks to this thread and investigation done by everybody).

Since the issue has been handled by cluster autoscaler as well as from aws load balancer, I suggest to close this issue if everyone else agrees to it.

Edit - Updated the PR linkf or Cluster Autoscaler and suggestion to close this github issues if everyone else agrees.

infa-ddeore commented 3 years ago

2. Cluster Autoscaler PR has also been merged. This PR adds Unschedulable taint apart from existing ToBeDeletedByClusterAutoscaler taint to the nodes which are being removed from Cluster. As per my understanding the change is in the master branch and has not been released yet

@atulaggarwal is this change available for k8 1.15 and 1.16 versions?

atulaggarwal commented 3 years ago

@infa-ddeore - The change should work in previous versions of k8s also as it is relying on UnSchedulable taint. As per the documentation, this taint exists from 1.10 k8s version link node.kubernetes.io/unschedulable (1.10 or later)

isugimpy commented 3 years ago

Looks like there hasn't been a cluster-autoscaler release since December, and this hasn't been ported to any of the earlier versions. Is it possible to get a release which includes it for all supported minor versions of cluster-autoscaler?

atulaggarwal commented 3 years ago

I have raised PR for merging it for merging the change to release-1.20 and release-1.19 also. We can raise port the fix to earlier verions also if required.

infa-ddeore commented 3 years ago

I have raised PR for merging it for merging the change to release-1.20 and release-1.19 also. We can raise port the fix to earlier verions also if required.

making it available to previous versions would be very helpful.

atulaggarwal commented 3 years ago

@infa-ddeore - Which version are you looking for? I can create PR for them over this weekend once above PR gets approved. I am not sure if anything else needs to be done apart from creating PR for previous versions of autoscaler.

infa-ddeore commented 3 years ago

@infa-ddeore - Which version are you looking for? I can create PR for them over this weekend once above PR gets approved. I am not sure if anything else needs to be done apart from creating PR for previous versions of autoscaler.

thx a lot for the reply, having support for 1.16, 1.17 and 1.18 would be great

mattyev87 commented 3 years ago

so whats the resolution here? We are using the CA, and have an application behind the nginx ingress controller using externalTrafficPolicy: Cluster and an internal NLB We have also activated cordon-node-before-terminating on the CA, set the Deregistration delay to 300 and set up the Lifecycle hook on our ASG to continue after a heart beat timeout of 180 seconds, and we are still getting 5XX when a node is terminated.

We can see that the node is draining for several minutes before Deregistration and termination, and yet we still get errors. Is it possible that the NLB to sends traffic to a draining node? what would be the point in that!?

any help would be greatly appreciated

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1907#issuecomment-955742047): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

infa-ddeore commented 2 years ago

1.19+ k8 doesn't remove worker node from ELB anymore so cordon-node-before-terminating isn't of any use, instead a label node.kubernetes.io/exclude-from-external-load-balancers with any value needs to be added to retain same behaviour

cluster autoscaler 1.19+ needs cordon-node-before-terminating similar option to add node.kubernetes.io/exclude-from-external-load-balancers label to worker node

sveronese commented 2 years ago

@infa-ddeore If you are using node group in ASG with ELB (Classic Loadbalancer) try to configure this on your Service definition. We are in externalTrafficPolicy: Cluster mode

    service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: 'true'
    service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: '30'

It seems to work fine for us as a workaround. We don't see any errors in production during scale-out. Reference to AWS doc

But for sure I'm voting 👍 for adding node.kubernetes.io/exclude-from-external-load-balancers mechanism to cluster-autoscaler

infa-ddeore commented 2 years ago

@sveronese we have ELB and externalTrafficPolicy: Cluster and draining is enabled on ELB.

scaling-out isnt an issue, scaling-in happens abruptly i.e instance is terminated before it's deregistered from an ELB, some connections fail until the health check starts failing.

Our developer ended up updating cluster autoscaler code, we added --deregister-node-from-lb-before-terminating=true option that adds node.kubernetes.io/exclude-from-external-load-balancers label to the terminating node along with cordon-node-before-terminating step, with this change, node is removed from the ELB.

After this our ASG lifecycle hook waits for some seconds before terminating the instance

sveronese commented 2 years ago

@infa-ddeore Thank you for sharing. That's interesting. Can you share on which EKS version you are running and which version of cluster-autoscaler? I'm really interested about that because we have the same configuration as you and for the moment we don't see major side effect. We configured cluster-autoscaler with following params :

        - "./cluster-autoscaler"
        - "--cloud-provider=aws"
        - "--expander=random"
        - "--max-bulk-soft-taint-count=10"
        - "--max-empty-bulk-delete=1"
        - "--max-node-provision-time=5m"
        - "--node-group-auto-discovery=asg:tag=kubernetes.io/custom-autoscaling,kubernetes.io/cluster-autoscaler/XXXX"
        - "--scale-down-delay-after-add=30m"
        - "--scale-down-delay-after-delete=3m"
        - "--scale-down-unneeded-time=10m"
        - "--scale-down-utilization-threshold=0.5"
        - "--skip-nodes-with-local-storage=false"
        - "--stderrthreshold=info"
        - "--unremovable-node-recheck-timeout=5m"
        - "--v=4"
        - "--leader-elect=false"

I'm really interest to understand that because as you mention, due to the fact that cluster-autoscaler doesn't add node.kubernetes.io/exclude-from-external-load-balancers label I'm expecting to have errors and that doesn't seems the case at the moment. Thank you

infa-ddeore commented 2 years ago

our eks clusters are on 1.18 to 1.20, for 1.19+ we added custom change to remove node from LB before terminating we use same CA version as of k8 + our custom changes

here is our CA config:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/xxxxx
        - --aws-use-static-instance-list=true
        - --balance-similar-node-groups=true
        - --balancing-ignore-label=beta.kubernetes.io/instance-type
        - --balancing-ignore-label=node.kubernetes.io/instance-type
        - --balancing-ignore-label=failure-domain.beta.kubernetes.io/zone
        - --balancing-ignore-label=topology.kubernetes.io/zone
        - --balancing-ignore-label=kubernetes.io/hostname
        - --balancing-ignore-label=vpc.amazonaws.com/eniConfig
        - --balancing-ignore-label=topology.ebs.csi.aws.com/zone
        - --balancing-ignore-label=eks.amazonaws.com/sourceLaunchTemplateId
        - --balancing-ignore-label=eks.amazonaws.com/sourceLaunchTemplateVersion
        - --cordon-node-before-terminating=true  <---------------------- this change is from current github issue for 1.18 and lower k8 versions
        - --deregister-node-from-lb-before-terminating=true <-------------------- this is our custom change for EKS 1.19+
        - --expander=least-waste
        - --logtostderr=true
        - --scan-interval=60s
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=info
        - --v=4

SCLogo commented 1 year ago

externalTrafficPolicy: Cluster

hello @infa-ddeore. could you please share your code about deregister-node-from-lb-before-terminating=true ? Thanks

kubernetes / autoscaler