kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.82k stars 3.87k forks source link

CA: Ignore taints for node update #6023

Closed kwohlfahrt closed 1 month ago

kwohlfahrt commented 11 months ago

Which component are you using?: Cluster autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

I would like to upgrade the base OS image in my cluster. The users of this cluster have some interactive workloads (e.g. Jupyter notebooks) running in the cluster, that I can't easily kill - I have to wait for the users to terminate them on their own time, which may be a few days. My previous attempt was as follows:

  1. Upload a new AMI
  2. Update the ASG backing the nodes to use the new AMI for newly created nodes (but don't start an instance refresh)
  3. Taint all existing nodes with NoSchedule, to ensure no new pods are scheduled on nodes with an old AMI
  4. Drain all non-interactive workloads, and wait for users to terminate their interactive pods as convenient

During step (4), any newly created replacement pods are unschedulable on existing nodes, due to the taint. I expected the CA to then scale up the ASG to make room for the new pods and scale in the old, tainted nodes once they were empty.

Unfortunately, this didn't happen. I don't have the exact logs anymore, but the error seemed to be that the CA was detecting that the existing nodes of the ASG were tainted, and assumed that any newly created nodes from the same ASG will also be tainted, and therefore did not create any new nodes. Existing old nodes were scaled down correctly when empty.

Describe the solution you'd like.:

I would like to be able to configure the autoscaler to not assume that taints on running nodes will apply to newly created nodes in the ASG. The autoscaler should consider only the taints in the ASG tags (e.g. k8s.io/cluster-autoscaler/node-template/taint/<taint>), and assume any freshly created nodes will match that spec.

I think this would solve the problem, as CA would then be able to create new nodes for the pending workloads that can't run on the existing tainted nodes.

Describe any alternative solutions you've considered.:

Additional context.:

Discussion in Slack with @vadasambar: https://kubernetes.slack.com/archives/C09R1LV8S/p1691051051375539

vadasambar commented 11 months ago

I would like to be able to configure the autoscaler to not assume that taints on running nodes will apply to newly created nodes in the ASG.

This is possible with --ignore-taint flag.

Specifies a taint to ignore in node templates when considering to scale a node group

https://github.com/kubernetes/autoscaler/blob/a3bcd98e23259d5db29b34f9d92359835977d55f/cluster-autoscaler/main.go#L195

CA maintains a template of what an upcoming node will look like (node template) for every ASG. As long as we tell CA to ignore a certain taint in this template node, it won't consider the taint when scaling up.

vadasambar commented 11 months ago

For the over-arching problem of how do I safely rollout a new AMI without causing disruptions for customers might be a problem well suited for cluster-api but I wonder if we can also do something here e.g., don't let CA scale down a node unless a certain condition is met for example.

kwohlfahrt commented 11 months ago

I would like to be able to configure the autoscaler to not assume that taints on running nodes will apply to newly created nodes in the ASG.

This is possible with --ignore-taint flag.

Specifies a taint to ignore in node templates when considering to scale a node group

https://github.com/kubernetes/autoscaler/blob/a3bcd98e23259d5db29b34f9d92359835977d55f/cluster-autoscaler/main.go#L195

CA maintains a template of what an upcoming node will look like (node template) for every ASG. As long as we tell CA to ignore a certain taint in this template node, it won't consider the taint when scaling up.

Maybe I'm just confused about the documentation then - I thought --ignore-taint was to ignore taints in the ASG tags like k8s.io/cluster-autoscaler/node-template/taint/<taint>, but you're saying it will also stop CA from adding taints on running nodes to the new node template?

Anyway, I'll test this flag (probably next week though), and see if it helps for an update rollout.

I wonder if we can also do something here e.g., don't let CA scale down a node unless a certain condition is met for example.

I think the scale-down logic is working OK, it's the lack of scale-up that was causing issues for me.

vadasambar commented 11 months ago

I thought --ignore-taint was to ignore taints in the ASG tags like k8s.io/cluster-autoscaler/node-template/taint/, but you're saying it will also stop CA from adding taints on running nodes to the new node template?

ASG tags are used to create a node template from scratch (ref1, ref2). This happens when no node template is present i.e., when CA has just started. Once CA scales up, a node for the ASG will be running in the cluster. This node will then be used as the node template (since it will have the taints from the ASG + anything else).

--ignore-taints scrubs the taints from the node template (whether it was created from scratch OR whether it was based on an existing node in the cluster)

https://github.com/kubernetes/autoscaler/blob/e1b03fac9958791790bfc18eeba9fab5cac0ccc1/cluster-autoscaler/core/utils/utils.go#L41-L48

Think of nodeGroup as ASG. We get node template on line 41 and call SanitizeNode on line 48 which internally calls SanitizeTaints.

https://github.com/kubernetes/autoscaler/blob/e1b03fac9958791790bfc18eeba9fab5cac0ccc1/cluster-autoscaler/core/utils/utils.go#L119 taintConfig you see above is passed to SanitizeTaints so that the node template is scrubbed off the taints specified in the --ignore-taint flag.

P.S.: We don't have a FAQ around --ignore-taint. We need better documentation around this.

kwohlfahrt commented 6 months ago

OK, it's been a long time, but I've just tested that the --ignore-taints flag does not do what I want. If I set this, I see logs like:

I1220 16:08:54.934581       1 taints.go:384] Overriding status of node i-0efa87576c37e0811, which seems to have ignored taint "charmtx.com/maintenance"
I1220 16:08:55.087296       1 klogx.go:87] Pod research/gpu-test-d694757df-7lg97 can be moved to template-node-for-k8s-kai-cluster-t3a.medium-eu-central-1b-e5fee97-4454553931225414682-upcoming-1

The cluster autoscaler seems to be completely ignoring the taint, and assuming that my pod can be scheduled to nodes that have this taint.

This is not what I want, I only want the autoscaler to not assume that new nodes will have the same taints as existing nodes, the autoscaler should only take the taints for new nodes from the ASG tags.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/6023#issuecomment-2120542934): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.