Open mahaveer08 opened 3 weeks ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
This the current node condition:
MemoryPressure Unknown Kubelet stopped posting node status. DiskPressure Unknown Kubelet stopped posting node status. PIDPressure Unknown Kubelet stopped posting node status. Ready Unknown Kubelet stopped posting node status.
What configurations should I include to resolve or remove nodes in the "NotReady" state, and what are the common reasons behind nodes entering the "NotReady" state?
Is there any configuration we need to add to handle nodes in the NotReady
state so they can be automatically restarted or drained? Currently, these nodes are getting stuck and remain in that state.
we are also facing this same issue with Nodes going to NotReady state quite often and remaining there
I'm not exactly sure what the issue is why the Kubelet keeps stopping communication with the nodes. Restarting the stopped instances temporarily fixes the problem, but I have too many nodes in an unknown state to manually restart them all, and this issue is happening frequently can't everytime go and restart. I need urgent help with this @ellistarn @jigisha620
Does anyone help on this is as often my nodes are going into unknown state as condition status shows 'Kubelet stopped posting node status.'
Description
What problem are you trying to solve? I'm currently managing a large-scale deployment using Karpenter, and I am encountering two specific issues:
Nodes in Unknown State: Some of my nodes have intermittently transitioned to an unknown state. This is causing disruptions in workload scheduling and cluster stability. I need to understand potential reasons for this behavior and how to troubleshoot and resolve it.
Clarification on High CPU Usage and CPU Limit Settings: I have several high-CPU node pools, and I'm trying to understand how to optimize CPU utilization and properly set CPU limits. Specifically, I would like guidance on the following:
How to prevent CPU overutilization when scaling large node pools (e.g., 14,000 CPUs+). Best practices for setting CPU limits in node pools to avoid resource exhaustion without underutilizing available capacity. How important is this feature to you? These issues are critical to the stability and efficiency of our cluster. The unknown node state issue directly impacts the reliability of our services, while high CPU usage and misconfigured CPU limits can lead to performance bottlenecks. Any guidance or fixes would significantly enhance the operation of our workloads at scale.