kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
540 stars 183 forks source link

Node NotReady Disruption Controller #1659

Open diranged opened 5 days ago

diranged commented 5 days ago

Description

What problem are you trying to solve? Sometimes nodes just become NotReady for a variety of reasons (bad cloud provider instance, non-responsive kubelet, etc). When a Node has been in a Ready state and then transitions into NotReady, I think that Karpenter should have another Disruption Controller that monitors for these nodes and terminates them.

Third party controllers like the Spot.io Ocean Product, and the Cluster Autoscaler both handle nodes that become NotReady for you automatically. Karpenter should be able to do the same thing.

(Note we have also raised this with our AWS TAM via a support ticket, and we were recommended to open a feature-request here)

Related: https://github.com/kubernetes-sigs/karpenter/issues/1573

How important is this feature to you?

This is actually a blocker for us migrating off of our current tools - we launch enough nodes and we have enough failures throughout the day that we cannot fully migrate unless we have a completely automated self healing system where these nodes get cycled out once they become NotReady.

(separate but related, is the ongoing discussion at https://github.com/bottlerocket-os/bottlerocket/issues/4075 about EKS nodes becoming unready due to heavy memory pressure)

k8s-ci-robot commented 5 days ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.