Open snieg opened 2 months ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Hi @snieg, You can orchestrate a similar behavior using disruption-budgets-by-reason. What you want to achieve has also been discussed here. Please take a look at the docs to understand more about how this can be implemented.
Likewise -- if you only want to disable drift for specific applications, you can use the karpenter.sh/do-not-disrupt
annotation and Karpenter will not disrupt pods with this annotation until their node expiration
Hi @snieg, You can orchestrate a similar behavior using disruption-budgets-by-reason. What you want to achieve has also been discussed here. Please take a look at the docs to understand more about how this can be implemented.
Unfortunately, it doesn't work.
My disruption
configuration looks like this:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
budgets:
- nodes: "100%"
reasons:
- "Empty"
- "Underutilized"
As you can see, "Drifted" is not provided as a reason. After changing the "ExpireAfter" parameter, all NodeClaims were recycled due to "drifted" reason
{"level":"INFO","time":"2024-08-19T13:48:06.986Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (15 pods) XXX.internal/m6a.xlarge/spot and replacing with node from types c6a.2xlarge","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"27b6b6d6-194b-4a2b-b9ef-1d6341cfb22a","command-id":"94b840bc-9f98-4ec6-8689-4541563c3f81","reason":"drifted"}
EDIT:
The configuration below is correct and does not allow recycling due to drifted:
budgets:
- nodes: "0"
reasons: [Drifted]
- nodes: "100%"
This is very dangerous indeed when you upgrade to 1.0. At the time of upgrade (0.37.3 -> 1.0.1) we were still with v1beta1 nodepools which don't allow you to add the disruption budget for Drift. As Karpenter 1.0 droped the global feature gate for drift, upon restart it will replace all drifted nodes, until you apply your v1 nodepool with the correct budget configuration to disable drift. Before upgrading to v1, manually add the do-not-disrupt annotation to your nodes !!!
Description
While reading the v1migration guide, I came across this information:
This is VERY dangerous and we would like to have the ability to turn Drift off.
Why? We have a lot of StatefulSets, like Solr or ScyllaDB, uncontrolled recycling of nodes can cause a lot of problems and even data loss.
Additionally, we have many very large clusters that can scale up to 200 nodes during peak hours, we want to avoid uncontrolled node recycling, because we changed a single option (like adding a tag to NodePools).
We want to control this and apply changes only to new instances, and instances with the old configuration should be deleted in their normal life cycle, therefore we need to have the ability to turn Drift off.
We are in the process of migrating to Karpenter, we have already migrated several clusters, with version v1.0 and Drift mode enabled by default we can't go further (too much risk with this option).