domgoodwin commented 9 months ago

Description

What problem are you trying to solve? We commonly have large scale out events where we might add 2-5k cores onto our cluster and then reverse the scaling once the event has past. Once this happens Karpenter can take 6-12 hours (depending on the size of the drop) to consolidate the workers so that we're back to a normal level of unused space on them. This means we have a significant amount of unused compute whilst it's scaling down and this happens multiple times a week.

In our other, older, cluster, which uses the cluster-autoscaler, the scale in of workers happens relatively quickly - within a couple of hours we see the nodes terminated and drained. We configure the scale down percentage on the cluster-autoscaler to be 80% so any node with it's request utilisation less then that is drained and terminated.

I'm not certain on the best way to address this, it seems like the node consolidation is happening one node at a time as we see a pretty linear drop in our worker node capacity. Maybe if Karpenter could consolidate in parallel or not as wait so long between consolidations this would be improved?

How important is this feature to you? Relatively important given the amount of wasted compute, we're exploring replicating the cluster-autoscaler behaviour where any node with < a % utilisation is drained and deleted but need to see how this might interact with Karpenter's behaviour.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

tzneal commented 9 months ago

Can you provide:

Karpenter version
Karpenter controller cpu/memory config
full controller logs
size of cluster in terms of pods and nodes both before and after scaling

engedaam commented 9 months ago

@domgoodwin do you have any updates here? Can you provide the information asked above?

domgoodwin commented 9 months ago

Can you provide:

Karpenter version

Karpenter controller cpu/memory config

full controller logs

size of cluster in terms of pods and nodes both before and after scaling

0.32.2

We don't enforce CPU limits and give it 8 cores in it's request, we don't see any signs it's CPU starved in the metric.

Memory 6gb request, 12gb limit

Cluster size drops by about 3k pods from peak, from 16k to 13k. Most of our resizing is scaling down the CPU requests per pod which causes the large drop.

In terms of workers the largest peaks goes from 230 worker nodes, but these all have at least 16 vCPUs but vary across 5 instance sizes and 3 types

Ill need to get logs next week I'm afraid but mostly we see a linear scale down where Karpenter seems to be doing a single node at a time and struggles when we scale down by such volumes

jonathan-innis commented 8 months ago

Ill need to get logs next week I'm afraid but mostly we see a linear scale down where Karpenter seems to be doing a single node at a time and struggles when we scale down by such volumes

Checking in again. Any update on logs here?

domgoodwin commented 7 months ago

Sorry trying to filter the logs proved a bit harder then expected.

An update on this, our pods as a legacy of migrating from an older cluster to EKS had both pod topology spread and anti affinities setup. Seemingly because of this the scheduler was taking a long time to simulate during node consolidation and then timing out.

Dropping antiAffinities from our pods had the affect on scheduling/simulating as:

Dropping p99 from 45s -> 1.5s
Dropping p90 from 2.8s -> 584ms
Dropping p50 from 2.1s -> 250ms

See here for a visualisation: Screenshot 2024-02-21 at 13 51 22

Even with this change we seem to be on the borderline of timing out and seeing log lines of

abandoning single-node consolidation due to timeout after evaluating 163 candidates

Previously it said <60 candidates so that's progress, but when our cluster is at peak (17k pods) we're hitting the hard coded 3m timeout.

I've raised this PR to configure the timeout so we can have it take longer for a cluster our size. https://github.com/kubernetes-sigs/karpenter/pull/1031

One thing I might explore is if this can be ran concurrently too as I've noticed Karpenter only uses up to 2 CPU cores regardless of the CPU it's given.

garvinp-stripe commented 7 months ago

I wonder why multi-node consolidation didn't kick in? This feels like it would way too long for single node consolidation to finish which would block disruption controller.

I am still reading into Karpenter's consolidation logic so could be wrong here.

First, https://github.com/kubernetes-sigs/karpenter/blob/2d8a616a751ad1ed46fd43f0c3ad6fd8f68fe6c1/pkg/controllers/disruption/singlenodeconsolidation.go#L58 is wrong I am pretty sure... there is no binary search logic nor does it make sense that single consolidation is killing more than 1 node.

Second,

Maybe if Karpenter could consolidate in parallel or not as wait so long between consolidations this would be improved?

I have been wondering, once again could be wrong, why evaluate all candidates in a single loop. If we can group nodepools together where their requirements would overlap then all pods that runs in nodes created by this nodepool group would in theory only ever be scheduled in nodes from this nodepool group. This allows us to partition the candidate set which would allow parallel processing of candidates?

Bryce-Soghigian commented 7 months ago

https://github.com/kubernetes-sigs/karpenter/pull/1031 seems like a bandaid, I would prefer we take the approach of improving the performance rather than just setting longer timeout.

garvinp-stripe commented 7 months ago

I think #1031 won't fix the issue. As mentioned I don't think relying on single node consolidation make sense in this situation however I did call out low hanging fruits we can tackle in improving deprovisioning performance https://github.com/kubernetes-sigs/karpenter/issues/370#issuecomment-1962179758

domgoodwin commented 6 months ago

1031 is definitely a bandaid, but setting it has also unblocked our cluster being able to scale down. Having it hard coded doesn't really make sense if you consider clusters could be 4 times the size though?

One thing we have noticed is all the processing happens in a couple of routines seemingly, our Karpenter CPU usage doesn't go above 2.5 cores

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jonathan-innis commented 2 months ago

/remove-lifecycle rotten

kubernetes-sigs / karpenter

Karpenter is slow to deprovision workers after a significant scale-in #903

Description

1031 is definitely a bandaid, but setting it has also unblocked our cluster being able to scale down. Having it hard coded doesn't really make sense if you consider clusters could be 4 times the size though?