kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
660 stars 211 forks source link

Mega Issue: Node Disruption Lifecycle Taints #624

Open njtran opened 1 year ago

njtran commented 1 year ago

Description

What problem are you trying to solve? Karpenter has driven disruption of nodes through annotations and processes maintained in memory.

Karpenter should drive disruption by through its own taint mechanism(s) while it discovers and executes disruption actions.

This issue proposes that each node owned by Karpenter will be in one of four states:

  1. Not Disrupting (No Taints) - Karpenter doesn't want to disrupt this node, and neither does the user.
  2. Candidate (PreferNoSchedule Taint) - Karpenter identifies a node as a possible option for disruption for any of the programmatic disruption mechanisms that Karpenter does - expiration, drift, consolidation. A node that's chosen as a candidate can always be removed from candidacy.
  3. Disrupting (NoSchedule Taint) - Karpenter has validated and executed the disruption action for the node, and has begun the standard flow. Karpenter can fail to disrupt a node. If it does, the node will go back to Not Disrupting, where it may be picked up as a Candidate again later.
  4. Terminating (NoExecute Taint) - Karpenter has deleted the node, triggering the finalization logic, where the last of the pods (e.g. Daemonsets) need to be evicted before terminating the underlying instance, then removing the node. Once a node has begun terminating, there's no turning back. Karpenter will eventually terminate it.

Related Issues:

Legion2 commented 12 months ago

I really like the idea of this issue. This will solve the spread out behavior of default scheduler when there is are continuously added new pods but there is much idle capacity. With the described behavior karpenter would taint some of the nodes with PreferNoSchedule and will cause the scheduler perform bin packing of the new pods on the remaining nodes instead of distributing them across all underutilized nodes. I hope there will be policies or configurations in place that allow Karpenter to identify nodes as disruption candidates even though they are still running some small jobs which can not be evicted.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jmdeal commented 5 months ago

/remove-lifecycle stale

Nuru commented 5 months ago

Please be sure to handle the use case where a Pod running on a Node adds a "do-not-evict" annotation while it is running. Of course there will be an unavoidable race condition, but it is important to realize that just because the Node is tainted, it does not mean that annotated Pods will not appear on the Node.

It would be good for my use case if there were a way for a Pod to get notified that Karpenter is considering consolidating the node (NoSchedule Taint added) so it can immediately decide to either quit or annotate itself, which will give the Pod a head start in the race and avoid most if not all real-world mishaps.

One way to do this would be via another annotation, such as ok-to-distrupt or prefer-to-disrupt or something, that tells Karpenter to send the pod some Signal other than SIGTERM that the Pod can respond to (and by default would ignore) when Karpenter considers the Node a likely consolidation target. This would have to be after the Node is tainted, so that when the Pod quits and is immediately replaced with a new Pod by the Deployment, the new Pod does not get scheduled onto the same Node. We would also want a configurable delay between the taint and notification in step 3 and the actual termination in step 4, so we can be sure to give enough time for the Pod to respond and block termination.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nuru commented 1 month ago

/remove-lifecycle rotten