kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
614 stars 203 forks source link

Ability to disable Drift in v1.0 #1576

Open snieg opened 2 months ago

snieg commented 2 months ago

Description

While reading the v1migration guide, I came across this information:

* FEATURE_GATES.DRIFT=true was dropped and promoted to Stable, and cannot be disabled.
    * Users currently opting out of drift, disabling the drift feature flag will no longer be able to do so.

This is VERY dangerous and we would like to have the ability to turn Drift off.

Why? We have a lot of StatefulSets, like Solr or ScyllaDB, uncontrolled recycling of nodes can cause a lot of problems and even data loss.

Additionally, we have many very large clusters that can scale up to 200 nodes during peak hours, we want to avoid uncontrolled node recycling, because we changed a single option (like adding a tag to NodePools).

We want to control this and apply changes only to new instances, and instances with the old configuration should be deleted in their normal life cycle, therefore we need to have the ability to turn Drift off.

We are in the process of migrating to Karpenter, we have already migrated several clusters, with version v1.0 and Drift mode enabled by default we can't go further (too much risk with this option).

k8s-ci-robot commented 2 months ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
jigisha620 commented 2 months ago

Hi @snieg, You can orchestrate a similar behavior using disruption-budgets-by-reason. What you want to achieve has also been discussed here. Please take a look at the docs to understand more about how this can be implemented.

jonathan-innis commented 2 months ago

Likewise -- if you only want to disable drift for specific applications, you can use the karpenter.sh/do-not-disrupt annotation and Karpenter will not disrupt pods with this annotation until their node expiration

snieg commented 2 months ago

Hi @snieg, You can orchestrate a similar behavior using disruption-budgets-by-reason. What you want to achieve has also been discussed here. Please take a look at the docs to understand more about how this can be implemented.

Unfortunately, it doesn't work.

My disruption configuration looks like this:

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
    budgets:
    - nodes: "100%"
      reasons:
      - "Empty"
      - "Underutilized"

As you can see, "Drifted" is not provided as a reason. After changing the "ExpireAfter" parameter, all NodeClaims were recycled due to "drifted" reason

{"level":"INFO","time":"2024-08-19T13:48:06.986Z","logger":"controller","message":"disrupting nodeclaim(s) via replace, terminating 1 nodes (15 pods) XXX.internal/m6a.xlarge/spot and replacing with node from types c6a.2xlarge","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"27b6b6d6-194b-4a2b-b9ef-1d6341cfb22a","command-id":"94b840bc-9f98-4ec6-8689-4541563c3f81","reason":"drifted"}

EDIT:

The configuration below is correct and does not allow recycling due to drifted:

    budgets:
    - nodes: "0"
      reasons: [Drifted]
    - nodes: "100%"
gitarns commented 1 month ago

This is very dangerous indeed when you upgrade to 1.0. At the time of upgrade (0.37.3 -> 1.0.1) we were still with v1beta1 nodepools which don't allow you to add the disruption budget for Drift. As Karpenter 1.0 droped the global feature gate for drift, upon restart it will replace all drifted nodes, until you apply your v1 nodepool with the correct budget configuration to disable drift. Before upgrading to v1, manually add the do-not-disrupt annotation to your nodes !!!