Closed wmgroot closed 1 month ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
For anyone experiencing similar symptoms without TerminationGracePeriod, check to see if you might be encountering a problem with volumemountattachments described in this issue. https://github.com/kubernetes-sigs/karpenter/issues/1684
Adding correspondence from slack:
Two potential influencing factors here:
I'm guessing you're more impacted by #2 based on the events I see. This is likely a difference from v0.37 since we didn't have TGP and didn't enqueue nodes for deletion that had do-not-disrupt/pdb blocking pods in the first place, making the likelihood that you had indefinitely draining nodes higher. Not to mention that we also now block eviction on do-not-disrupt so the average drain time might be higher than in v0.37
If I had to guess, the best way to fix this would be for us to solve our preferences story (https://github.com/kubernetes-sigs/karpenter/issues/666), and add a PreferNoSchedule taint for drifted nodes. We discussed adding that taint, but we didn't have a consistent story around how both consolidation/disruption/provisioning could all align so we don't get any flapping issues.
More correspondence in slack: (of which i know some is included here already) Yeah we should definitely dive into it, once again these were just theories. Can you open the issue with the logs/events and check which pods are being considered here? There should be pod nomination events and node nomination events. It'd be interesting to see the following:
@wmgroot any thoughts here? did you get a chance to validate what I was saying?
@cnmcavoy identified a bug in Karpenter's logic that tracks nodes marked for deletion. There's an error case which can fail to unmark a marked node, resulting in disruption budgets being reached while no progress can be made. We've got a patch that we've been testing for the last week and plan to open a PR for soon.
We think that TGP is not directly related to this problem, but was exacerbating the issue since nodes in a terminating state take up space in the disruption budget while they're pending termination.
Ultimately the bug was introduced in some of our patched code to address issues with single and multi-node consolidation. We plan to work further with the maintainers on improvements to consolidation to avoid the need to run a forked version of Karpenter.
After addressing the bug in our patch, we have seen our disruption frequencies and cluster scale return to pre-v1 levels. We'll re-open or create a new issue if we notice anything else amiss with TGP and drift disruption.
Description
Observed Behavior: NodeClaims enter a Drifted state, but fail to become disrupted.
No volumeattachments are involved with this issue.
My node says that disruption is blocked due to a pending pod, but I have no pending pods in my cluster, and the node in question has a taint to allow only a single do-not-disrupt pod to schedule there as a test case.
Expected Behavior: Nodes with a TerminationGracePeriod set that include do-not-disrupt or PDB-blocked pods are able to be disrupted due to NodeClaim drift and are eventually drained. A new NodeClaim is created immediately once disruption of the old NodeClaim begins.
Reproduction Steps (Please include YAML):
Versions: 1.0.1
Chart Version: 1.0.1
Kubernetes Version (
kubectl version
): 1.29Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment