kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
539 stars 180 forks source link

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Open mahaveer08 opened 3 weeks ago

mahaveer08 commented 3 weeks ago

Description

What problem are you trying to solve? I'm currently managing a large-scale deployment using Karpenter, and I am encountering two specific issues:

Nodes in Unknown State: Some of my nodes have intermittently transitioned to an unknown state. This is causing disruptions in workload scheduling and cluster stability. I need to understand potential reasons for this behavior and how to troubleshoot and resolve it.

Clarification on High CPU Usage and CPU Limit Settings: I have several high-CPU node pools, and I'm trying to understand how to optimize CPU utilization and properly set CPU limits. Specifically, I would like guidance on the following:

How to prevent CPU overutilization when scaling large node pools (e.g., 14,000 CPUs+). Best practices for setting CPU limits in node pools to avoid resource exhaustion without underutilizing available capacity. How important is this feature to you? These issues are critical to the stability and efficiency of our cluster. The unknown node state issue directly impacts the reliability of our services, while high CPU usage and misconfigured CPU limits can lead to performance bottlenecks. Any guidance or fixes would significantly enhance the operation of our workloads at scale.

k8s-ci-robot commented 3 weeks ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
mahaveer08 commented 3 weeks ago

This the current node condition:

MemoryPressure Unknown Kubelet stopped posting node status. DiskPressure Unknown Kubelet stopped posting node status. PIDPressure Unknown Kubelet stopped posting node status. Ready Unknown Kubelet stopped posting node status.

mahaveer08 commented 3 weeks ago

What configurations should I include to resolve or remove nodes in the "NotReady" state, and what are the common reasons behind nodes entering the "NotReady" state?

mahaveer08 commented 3 weeks ago

Is there any configuration we need to add to handle nodes in the NotReady state so they can be automatically restarted or drained? Currently, these nodes are getting stuck and remain in that state.

suraj2410 commented 2 weeks ago

we are also facing this same issue with Nodes going to NotReady state quite often and remaining there

mahaveer08 commented 2 weeks ago

I'm not exactly sure what the issue is why the Kubelet keeps stopping communication with the nodes. Restarting the stopped instances temporarily fixes the problem, but I have too many nodes in an unknown state to manually restart them all, and this issue is happening frequently can't everytime go and restart. I need urgent help with this @ellistarn @jigisha620

mahaveer08 commented 4 days ago

Does anyone help on this is as often my nodes are going into unknown state as condition status shows 'Kubelet stopped posting node status.'