Open domgoodwin opened 9 months ago
Can you provide:
@domgoodwin do you have any updates here? Can you provide the information asked above?
Can you provide:
- Karpenter version
- Karpenter controller cpu/memory config
- full controller logs
- size of cluster in terms of pods and nodes both before and after scaling
0.32.2
We don't enforce CPU limits and give it 8 cores in it's request, we don't see any signs it's CPU starved in the metric.
Memory 6gb request, 12gb limit
Cluster size drops by about 3k pods from peak, from 16k to 13k. Most of our resizing is scaling down the CPU requests per pod which causes the large drop.
In terms of workers the largest peaks goes from 230 worker nodes, but these all have at least 16 vCPUs but vary across 5 instance sizes and 3 types
Ill need to get logs next week I'm afraid but mostly we see a linear scale down where Karpenter seems to be doing a single node at a time and struggles when we scale down by such volumes
Ill need to get logs next week I'm afraid but mostly we see a linear scale down where Karpenter seems to be doing a single node at a time and struggles when we scale down by such volumes
Checking in again. Any update on logs here?
Sorry trying to filter the logs proved a bit harder then expected.
An update on this, our pods as a legacy of migrating from an older cluster to EKS had both pod topology spread and anti affinities setup. Seemingly because of this the scheduler was taking a long time to simulate during node consolidation and then timing out.
Dropping antiAffinities from our pods had the affect on scheduling/simulating as:
See here for a visualisation:
Even with this change we seem to be on the borderline of timing out and seeing log lines of
abandoning single-node consolidation due to timeout after evaluating 163 candidates
Previously it said <60 candidates so that's progress, but when our cluster is at peak (17k pods) we're hitting the hard coded 3m timeout.
I've raised this PR to configure the timeout so we can have it take longer for a cluster our size. https://github.com/kubernetes-sigs/karpenter/pull/1031
One thing I might explore is if this can be ran concurrently too as I've noticed Karpenter only uses up to 2 CPU cores regardless of the CPU it's given.
I wonder why multi-node consolidation didn't kick in? This feels like it would way too long for single node consolidation to finish which would block disruption controller.
I am still reading into Karpenter's consolidation logic so could be wrong here.
First, https://github.com/kubernetes-sigs/karpenter/blob/2d8a616a751ad1ed46fd43f0c3ad6fd8f68fe6c1/pkg/controllers/disruption/singlenodeconsolidation.go#L58 is wrong I am pretty sure... there is no binary search logic nor does it make sense that single consolidation is killing more than 1 node.
Second,
Maybe if Karpenter could consolidate in parallel or not as wait so long between consolidations this would be improved?
I have been wondering, once again could be wrong, why evaluate all candidates in a single loop. If we can group nodepools together where their requirements would overlap then all pods that runs in nodes created by this nodepool group would in theory only ever be scheduled in nodes from this nodepool group. This allows us to partition the candidate set which would allow parallel processing of candidates?
https://github.com/kubernetes-sigs/karpenter/pull/1031 seems like a bandaid, I would prefer we take the approach of improving the performance rather than just setting longer timeout.
I think #1031 won't fix the issue. As mentioned I don't think relying on single node consolidation make sense in this situation however I did call out low hanging fruits we can tackle in improving deprovisioning performance https://github.com/kubernetes-sigs/karpenter/issues/370#issuecomment-1962179758
One thing we have noticed is all the processing happens in a couple of routines seemingly, our Karpenter CPU usage doesn't go above 2.5 cores
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
Description
What problem are you trying to solve? We commonly have large scale out events where we might add 2-5k cores onto our cluster and then reverse the scaling once the event has past. Once this happens Karpenter can take 6-12 hours (depending on the size of the drop) to consolidate the workers so that we're back to a normal level of unused space on them. This means we have a significant amount of unused compute whilst it's scaling down and this happens multiple times a week.
In our other, older, cluster, which uses the cluster-autoscaler, the scale in of workers happens relatively quickly - within a couple of hours we see the nodes terminated and drained. We configure the scale down percentage on the cluster-autoscaler to be 80% so any node with it's request utilisation less then that is drained and terminated.
I'm not certain on the best way to address this, it seems like the node consolidation is happening one node at a time as we see a pretty linear drop in our worker node capacity. Maybe if Karpenter could consolidate in parallel or not as wait so long between consolidations this would be improved?
How important is this feature to you? Relatively important given the amount of wasted compute, we're exploring replicating the cluster-autoscaler behaviour where any node with < a % utilisation is drained and deleted but need to see how this might interact with Karpenter's behaviour.