Open jonathan-innis opened 7 months ago
Also, as part of the alignment effort with Cluster Autoscaler, I imagine whatever change that we suggested should be made in upstream would also apply to Cluster Autoscaler. Perhaps aligning on the taint that we both want to use and proposing a way that that taint could have special logic built around it in the node printer columns is a change that we could try to get into upstream?
cc: @MaciekPytel @towca
Also also, there was a discussion in the K8s Slack over whether Karpenter should be using the unschedulable
field on the node to piggy-back on this logic or not. Practically, we have chosen to separate ourselves from the basic "cordon" handling for the following reasons:
node.kubernetes.io/unschedulable
taint by default)The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I would like to note that other controllers are relying on the SchedulingDisabled-Taint to see if a node is shutting down or not.
The CloudNativePG operator for example has a PDB that disallows deleting the pod. If a node with a CNPG pod receives the SchedulingDisabled-Taint, the operator will start to migrate that pod itself. Since Karpenter does not use the taint, the node is stuck.
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Description
What problem are you trying to solve?
Currently, Kubernetes uses the
node.kubernetes.io/unschedulable
taint and thespec.unschedulable
field on the node to mark that a node is cordoned and may be about to be drained for maintenance or removal. This is visible through the printer columns that you get when you callkubectl get nodes
like the followingThe code for this handling can be seen in the printer columns logic for
kubectl
here.This is nice visibility for users when Kubernetes is using this specific field; however, nothing is surfaced when Karpenter adds its taint and is actively draining the node since Karpenter doesn't update the
spec.unschedulable
field that the printer relies on to add theSchedulingDisabled
section to the node.It would be a really nice UX if we could add something similar to
SchedulingDisabled
(perhaps something likeDisrupting
orTerminating
) to the node so that users get visibility through the printer that Karpenter is acting on the node.