Karpenter behaviour with Disruption Budget

thangle-grabtaxi commented 3 months ago

Description

Observed Behavior: When we make changes to Karpenter node pool, specifically AMI change, with a disruption budget of 100%, we expect Karpenter to rotate all nodes at once due to Drift. However, it only rotate 1 to 4 nodes at a time.

Disruption Budget is set to 100%, which comes out to be around 35-40 nodes.

During allowed disruption period of 10 minutes minutes, Karpenter rotate the instances in sequence, 1 to 4 nodes per batch.

This cause an issue of frequent restart for our applications. We have set specifically 100% disruption budget within a short time frame (10 mins) with the expectation of all nodes will be restarted only once, meaning only one restart for our applications.

Expected Behavior: Karpenter rotate all nodes at once.

Reproduction Steps (Please include YAML):

    apiVersion: karpenter.sh/v1beta1
    kind: NodePool
    metadata:
      name: <NODE_POOL_NAME>
    spec:
      template:
        metadata:
          labels:
            node_group_name: <NODE_POOL_NAME>
        spec:
          nodeClassRef:
            name: <NODE_CLASS_NAME>
          requirements:
            - key: "karpenter.k8s.aws/instance-hypervisor"
              operator: In
              values: ["nitro"]
            - key: "karpenter.sh/capacity-type"
              operator: In
              values: ["on-demand"]
            - key: kubernetes.io/os
              operator: NotIn
              values: ["windows"]
            - key: kubernetes.io/arch
              operator: In
              values: ["arm64", "amd64"]
            - key: "karpenter.k8s.aws/instance-generation"
              operator: Gt
              values: ["3"]
      disruption:
        consolidationPolicy: WhenUnderutilized
        expireAfter: Never
        budgets:
          # Disruption Window: 10 minutes from 3pm to 3.10pm SGT (7am to 7.10am UTC)
          # Disruption Impact: 100%
          # Non Disruption Window: From 3.10pm to 3pm SGT (7.10am to 7am UTC)
          - nodes: "0"
            schedule: "10 7 * * *"
            duration: "23h50m"
          - nodes: "100%"
      limits:
        cpu: "10000"
        memory: 10000Gi

Versions:

Chart Version: v0.35.5
Kubernetes Version (kubectl version): 1.27
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

njtran commented 3 months ago

A disruption budget prescribes what the maximum amount of disruption is allowed, but there are other safeguards that are put in place while drifting nodes. We ensure that replacement nodes are online and healthy before disrupting drifted nodes, which naturally limits the total number of nodes that can be disrupting at once. I'd recommend using the observed rate of disruption you see as a way to understand how long you'd actually want to have your budgets be at 100% to get a full roll of your nodes.

github-actions[bot] commented 2 months ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

kubernetes-sigs / karpenter

Karpenter behaviour with Disruption Budget #1350

Description