Open ellistarn opened 2 years ago
We must have a way to run a minimum number of nodes in a nodepool.
You can already do that (run low-priority placeholder Pods), but AFAIK there's no controller that does exactly this. Maybe I'll put some time in and try to write one.
Same case here, we need to have a minimum set of nodes, which should be evenly spread among the AZs (AWS). We want to have always extra capacity available for the workload peaks and we can't wait for the spin up/down dance most of the time.
@sftim not sure but maybe the cluster autoscaler provides something similar? About the low-prio placeholder Pods, have you seen any good guide to do it? Sounds like a hack tho.
Cluster Autoscaler documents how to overprovision cluster to offset node provisioning by running preemptable pods. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler
In the same time this solution isn't exclusive to Cluster Autoscalers, it works just fine with Karpenter and any other potential autoscaler. I wouldn't consider it as a hack, it's implemented via stable kubernetes resources using common practices.
There is ready to use helm chart at https://github.com/deliveryhero/helm-charts/tree/master/stable/cluster-overprovisioner.
Cluster Autoscaler now also supports ProvisioningRequest CRD
There is an additional financial impact caused by the de-facto solution for a warm pool of capacity by overprovisioning. When high priority pods are scheduled, preempting the overprovision
workloads, Karpenter will scale the nodegroup immediately in order to regain capacity to reschedule the overprovision
workloads. In some cases it may be preferable for the headroom to be elastic, such that the desired headroom is restored after a deployment or batch job completes. This could be accomplished by setting a desired range of overprovisioned resources. But I also believe @sftim minimumPodPriority
solution presents an acceptable solution.
Cluster Autoscaler now also supports ProvisioningRequest CRD
Does Karpenter have a plan to implement this? It's really helpful for AI workloads.
+1
Cross ref-ing the ProvisioningRequest ask here: https://github.com/kubernetes-sigs/karpenter/issues/742#issuecomment-2122005473
cc @raywainman who is tracking warm replicas stories on behalf of WG Serving here:
https://docs.google.com/document/d/1QsN4ubjerEqo5L4bQamOFFS2lmCv5zNPis2Z8gcIITg
For everyone's context, we did a little bit of ideating and came-up with an API that we were pretty happy with from the Karpenter side (see https://github.com/jonathan-innis/karpenter-headroom-poc). We're having an open discussion with the CAS folks about the difference between how we are thinking about the Headroom API and the ProvisioningRequest API, feel free to take a look and comment on the doc if you have any thoughts: https://docs.google.com/document/d/1SyqStWUt407Rcwdtv25yG6MpHdNnbfB3KPmc4zQuz1M/edit?usp=sharing
Hi guys, how's it going?. I have been looking for a fixed provisioning solution with Karpenter-
We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.
Some workaround or thougts about how to solve it, we really want to fully use Karpenter for our workload.
Hi, @balmha
Based on Karpenter, we developed a feature to ensure a minimum number of non-spot replicas for each workload, as illustrated below:
Under the hood, it's a webhook component that monitors the distribution of each workload and modifies the pods' affinity to prefer spot instances while requiring some to run on-demand. You can check it out here: CloudPilot Console.
This is not a promotion, just a technical communication.
We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.
You don't need this issue resolved for your use case @balmha.
NodePool
with requirement karpenter.sh/capacity-type
set to ["on-demand"]
and higher weight
. Set limits
to control how much on-demand "base capacity" you need. NodePool
with requirement karpenter.sh/capacity-type
set to ["spot", "on-demand"]
(it'll schedule spot nodes and only fallback to on-demand in case of availability issues) with lower weight
.Even better, you can let your workloads that can't tolerate interruptions, set nodeSelectors or affinity to run on on-demand nodes. That's something which is cleaner to do with Karpenter than on CA.
@jonathan-innis I really like the look of the headroom APIs, I think they'd cover the requirements I was talking about in #993. Is there a separate issue to track the headroom APIs? Google docs are blocked from our corporate machines.
Is this still planned to be implemented? Any kind of ETA?
Our use case also would greatly benefit from it: We scale up GitLab Runners for our organization. However the cold-starts (60-90 secs) is not ideal for CI/CD. Having 1-2 nodes always available as headroom would make sure that every pipeline coming in, directly get started. Ideally, the minimum number of nodes can be set for a specific schedule (e.g. office hours only) to make sure that during night or weekend it ramps down to zero to minimize costs. (Or is there any other approach I can choose to fulfill my usecase with e.g. EKS + Karpenter?)
Our team is using blue/green deployment and always needs to wait a few minutes to make the new nodes come. And then Karpenter will disrupt nodes (after deployment, there will be an overuse of resources). It would be great if we could allow a buffer like CPU or Memory. For example, if we constantly use a 1000m CPU, it would be a 1200m—20% buffer for deployment and cronjob running.
One suggestion for always have a fixed set of nodes available is having a deployment with proper topologyspreadconstraints +pdb to force the pods one node per replica. (set very low resource requirements)
Don't set a lower priority so it won't get prempted (as in the prewarm/headroom Szenario mentioned before).
If you combine this with keda you can do that even dynamically basted on external factors, e.g. cron time
Tell us about your request What do you want us to build?
I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.
Use Cases:
Community Note