ConsolidationPolicy: WhenEmpty

jderieg commented 1 week ago

Description

I originally posted this in Discussions, but it got no traction there, so posting it here. I think it may be a bug because it definitely does not behave as expected.

Observed Behavior: I've been testing the WhenEmpty policy, but it does not seem to be behaving as expected if the consolidateAfter setting is any more than about 2 to 3 minutes. My disruption settings look like this:

    disruption:
      consolidationPolicy: WhenEmpty
      consolidateAfter: 10m
      expireAfter: 360h

As a test, I scale up a deployment to a large number of pods in my nodegroup so that Karpenter spins up a new node. That works fine. When I scale the deployment back down to 0, I would expect Karpenter to scale down (remove) the Karpenter node after 10m of that deployment no longer needing it. That never happens. I've let it sit for over 24 hours and it never removes the node, even though there aren't anymore workloads added to the node to keep it alive. The strange thing is, if I set that consolidateAfter value to 2m or under, it works as I would expect, and removes the node. I'm running Karpenter v0.37.

Expected Behavior: Consolidate the node(s) after the time specified in 'consolidateAfter'

Reproduction Steps (Please include YAML):

    disruption:
      consolidationPolicy: WhenEmpty
      consolidateAfter: 10m
      expireAfter: 360h

Versions:

Chart Version: 0.37.0
Kubernetes Version (kubectl version): 1.29
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

leoryu commented 1 week ago

What dose your budgets look like? Please set the budgets to 100% to make sure all your nodes cloud be consolidated:

disruption:
    budgets:
    - nodes: 100%
    consolidateAfter: 10m
    consolidationPolicy: WhenEmptyOrUnderutilized

jonathan-innis commented 2 days ago

That never happens. I've let it sit for over 24 hours and it never removes the node, even though there aren't anymore workloads added to the node to keep it alive

Can you share the spec/status of the node when it was left around for 24h? There's a couple fields lastPodEventTime and the conditions block that should give us a little more info. Karpenter will add a Consolidatable status condition after the pod has surpassed its consolidateAfter. If that doesn't get added, that means that the lastPodEventTime is too close.

If that's not the behavior and the lastPodEventTime has truly surpassed your consolidateAfter, then yeah, that definitely seems like a bug.

jonathan-innis commented 2 days ago

/triage accepted

jonathan-innis commented 2 days ago

/triage needs-information

kubernetes-sigs / karpenter

ConsolidationPolicy: WhenEmpty #1647

Description