kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.08k stars 3.97k forks source link

Cluster Autoscaler Forgets Nodes Scheduled for Deletion during Restart #5048

Open jabdoa2 opened 2 years ago

jabdoa2 commented 2 years ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.15", GitCommit:"8f1e5bf0b9729a899b8df86249b56e2c74aebc55", GitTreeState:"clean", BuildDate:"2022-01-19T17:23:01Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS using Kops

What did you expect to happen?:

When cluster-autoscaler selects a node for deletion it cordones it and then after 10 minutes deletes it under any circumstances.

What happened instead?:

When cluster-autoscaler is restarted (typically due to scheduling) it "forgets" about the cordoned node. Now we end up with nodes which are unused and no longer considered by cluster-autoscheduler. We have seen this happen multiple times in different clusters. If always (and only) happens when cluster-autoscaler restarts after tainting/cordoning the node.

How to reproduce it (as minimally and precisely as possible):

  1. Wait to cluster-autoscaler to select and mark a node for deletion
  2. After cluster-autoscheduler cordoned the node delete the cluster-autoscheduler pod
  3. Cluster-autoscheduler will be recreated (and usually the other cluster-autoscheduler pod will take over)
  4. Cordoned node stays there forever

Anything else we need to know?:

Config:

      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/xxxx
      --balance-similar-node-groups=false
      --cordon-node-before-terminating=true
      --ignore-daemonsets-utilization=true
      --ignore-mirror-pods-utilization=true
      --logtostderr=true
      --scale-down-utilization-threshold=0.99
      --skip-nodes-with-local-storage=false
      --stderrthreshold=info
      --v=4

Log on "old" pod instance:

scale_down.go:791] xxxx was unneeded for 9m52.964516731s
static_autoscaler.go:503] Scale down status: unneededOnly=false lastScaleUpTime=2022-07-25 13:02:50.704606468 +0000 UTC m=+17851.084374773 lastScaleDownDeleteTime=2022-07-25 13:17:25.415659793 +0000 UTC m=+18725.795428101 lastScaleDownFailTime=2022-07-25 13:02:50.704606636 +0000 UTC m=+17851.084374939 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
static_autoscaler.go:516] Starting scale down

Logs after the node has been "forgotten" in the new pod instance:

scale_down.go:407] Skipping xxxx from delete consideration - the node is currently being deleted
[...] # two hours later
scale_down.go:427] Node xxxx - memory utilization 0.000000S
static_autoscaler.go:492] xxxx is unneeded since 2022-07-25 13:28:13.284790432 +0000 UTC m=+1511.821407617 duration 2h8m33.755684792s

Autoscaler clearly still "sees" the node but it does not act on it anymore.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 2 years ago

Issue still exists.

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 1 year ago

Still exists and happening multiple times per week for us.

/remove-lifecycle stale

rcjsuen commented 1 year ago

Still exists and happening multiple times per week for us.

@jabdoa2 I am a just a random person on GitHub but was wondering what what version of the Cluster Autoscaler are you using now? You mentioned 1.20.2 when you opened the bug last year. Have you updated since then? We use 1.22.3 ourselves so just wondering if this is something we should keep an eye on as well.

Thank you for your information.

jabdoa2 commented 1 year ago

We updated to 1.22 by now. The issue still persists.

You can work around it by running the autoscaler on nodes which are not scaled by the autoscaler (i.e. a master or a dedicated node group). However, this issue still occurs when those nodes are upgraded or experience disruptions for other reasons. This issue is still 100% reproducible on all our clusters if you delete the autoscaler within the 10min grace period before deleting a node. We strongly recommend to monitor for nodes which have been cordoned for more than a few minutes. Those prevent scale ups in that node group later on and will cost you money without any benefit.

You might also want to monitor for nodes which are not part of the cluster which have been an issue earlier. However, we have not seen this recently as the autoscaler seem to remove those nodes after a few hours (if they still got the correct tags).

vadasambar commented 1 year ago

This issue can be seen in 1.21.x as well. cluster-autoscaler sees the cordoned node and logs out messaging saying it's unneeded (as described in the issue description). It also considers the cordoned node as a possible destination for unschedulable pods when it runs simulations for scale-up. If the unschedulable pod can be scheduled on the cordoned node, cluster autoscaler gives up on bringing up a new node. This makes the pod get stuck in Pending state forever because it can't get scheduled on the cordoned node and cluster-autoscaler won't bring up a new node either.

vadasambar commented 1 year ago

As a short term solution, removing the cordoned node manually fixes the issue.

vadasambar commented 1 year ago

I wasn't able to reproduce the problem with the steps in the description. I wonder if it happens only sometimes or maybe I am doing something wrong.

jabdoa2 commented 1 year ago

For us this happens 100% reliably in multiple clusters. At which step did it behave differently for you?

vadasambar commented 1 year ago

@jabdoa2 I used slightly different flags in an attempt to perform the test quickly:

--scale-down-unneeded-time=1m
--unremovable-node-recheck-timeout=1m

image image https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

vadasambar commented 1 year ago

Maybe the issue shows up with the default values since you are using default values. When I saw the issue on my end, I saw default values being used as well.

vadasambar commented 1 year ago

~I was able to reproduce the issue without having to revert to the default flags. The trick is to kill the CA pod just before (last iteration of scale-down; scale-down loop runs every 10 seconds by default) timeout set for scale-down-unneeded-time. Timing is important here which makes it harder to reproduce it manually.~

New CA pod was able to delete the node after some time. :(

vadasambar commented 1 year ago

I noticed this issue happens when the cluster-autoscaler pod tries to scale down the node it's running on in which case it drains itself out from the node and leaves the node in cordoned and tainted state.

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node. This disables the scale down and makes the scale down go into cooldown thereby effectively skipping the code which does actual scale down until the cool down is lifted (which will never happen because the unschedulable pod would never get scheduled on a cordoned and tainted node it can't tolerate)

vadasambar commented 1 year ago

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

vadasambar commented 1 year ago

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

jabdoa2 commented 1 year ago

It seems like cluster-autoscaler doesn't consider the tainted and cordoned state of the node when running simulations.

Yeah it reports the node but simply never acts on it. Looks weird and it can cause a lot of havoc in your cluster when important workload can no longer be scheduled.

jabdoa2 commented 1 year ago

One quick fix for this can be to make sure cluster-autoscaler pod is never drained from the node on which it is running. This can be done by adding a strict PDB (e.g, maxUnavailable: 0) or making sure the cluster-autoscaler pod satisfies the criteria for blocking draining of the node it is running on.

It helps most of the time. You can also run autoscaler on the master nodes or set safe-to-evict: false. But even with that we have seen this bug when we were rolling the cluster using kops or during other disruptions (such as spot instance or maintenance node removal).

vadasambar commented 1 year ago

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

jabdoa2 commented 1 year ago

@jabdoa2 a dedicated nodegroup with taints so that nothing else gets scheduled on it except cluster-autoscaler should solve the issue (for all cases I think) until we've a better solution.

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-). So it wont happen in the happy case but when things go south this tends to persist breakage and prevent clusters from recovering (i.e. because you can no longer schedule to a certain AZ).

zaafar commented 1 year ago

The real problem starts a when new cluster-autoscaler pod comes up, it sees an unschedulable pod and thinks it can schedule that pod on the cordoned and tainted node.

Sounds like this is the root cause and should be fixed.

vadasambar commented 1 year ago

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

  1. We will have multiple cluster-autoscaler replicas with PDB
  2. These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)
  3. In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

Sounds like this is the root cause and should be fixed.

Agreed.

jabdoa2 commented 1 year ago

Unless you roll that group last during a cluster upgrade or you experience disruptions of any kind ;-).

Sorry, I am not sure I understand this fully.

My understanding is,

1. We will have multiple cluster-autoscaler replicas with PDB

2. These replicas would be scheduled on a dedicated nodegroup where only cluster-autoscaler pods are scheduled (this can be achieved with taints)

3. In case of any disruptions, say the node where cluster-autoscaler was running goes down OR cluster-autoscaler itself scales down the node, other replica can take over and scale down the node properly. Note that the node scale down issue happens only when there is a pending pod and cluster-autoscaler thinks it can schedule the pending pod on the cordoned node. This wouldn't happen if we run only cluster-autoscaler on a dedicated nodegroup. This is because cluster-autoscaler won't think the pending pods can be scheduled on the cordoned node since it has taints.

Do you see any problem with this approach (just trying to understand what I am missing)

The issue can still happen in other node groups. If a scale down has been ongoing and a disruption happens to the current autoscaler there is a chance that this will happen. You can make those disruptions less likely by either a dedicated node group or by running the autoscaler on the master nodes but that will only reduce the chance. Rolling node groups, upgrading the autoscaler or node disruptions still trigger this. We got a few clusters which use spot instances and scale a lot so it keeps happening.

vadasambar commented 1 year ago

I see the problem with the solution I proposed. Thanks for explaining.

vadasambar commented 1 year ago

Brought this up in the SIG meeting today. Based on discussion with @MaciekPytel, there seem to be 2 ways of going about fixing this:

  1. Make cluster-autoscaler remove all taints when it restarts
  2. Fix the code around scale-up simulation so that it considers taints/cordoned state of the node
vadasambar commented 1 year ago

Related issue: https://github.com/kubernetes/autoscaler/issues/4456

Looks like the problem might be fixed in 1.26 version of cluster-autoscaler: https://github.com/kubernetes/autoscaler/pull/5054

vadasambar commented 1 year ago

We would need another PR on top of https://github.com/kubernetes/autoscaler/pull/5054 as explained in https://github.com/kubernetes/autoscaler/pull/5054#issuecomment-1381181229 to actually fix the issue.

vadasambar commented 1 year ago

We have logic for removing all taints on the node and uncordon the nodes every time cluster-autoscaler restarts but that is not called when --cordon-node-before-terminating=true is used because the logic to list nodes which need to be untainted doesn't consider cordoned nodes (ref2, ref3). All of the links are for 1.21 commit of cluster-autoscaler. Not sure if the issue still persists in the master branch.

If the flag is removed, taints should be removed for all nodes every time the cluster-autoscaler pod restarts.

fookenc commented 1 year ago

Hi @vadasambar, I'm not sure if it solves the issue mentioned, but there was a separate PR #5200. This was merged last year in September. It changed the behavior so that taints should be removed from All nodes instead of only those that were Ready. I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

vadasambar commented 1 year ago

@fookenc thanks for replying. Looks like we've already fixed the issue in 1.26 :) I was looking at https://github.com/kubernetes/autoscaler/pull/4211 which had similar code and thought we decided not to merge it.

I've been reviewing the code to check, and the NewAllNodeLister doesn't appear to have a filter set which should target all nodes, if I'm understanding correctly. Please correct if I've misunderstood.

You are right. Your PR should fix the issue mentioned in https://github.com/kubernetes/autoscaler/issues/5048#issuecomment-1468368775 i.e., the problem described in the description of this issue.

There is an overarching issue around scale up preventing scale down because CA thinks it can schedule pods on an existing node (when it can't because the node has taints or is cordoned) for which we already have your PR https://github.com/kubernetes/autoscaler/pull/5054 merged. My understanding is implementing those interfaces for specific cloud provider should fix the issue in that cloud provider.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 9 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 4 months ago

/remove-lifecycle stale

Bug still exists

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jabdoa2 commented 1 month ago

/remove-lifecycle stale