Mega Issue: Manual node provisioning

ellistarn commented 2 years ago

Tell us about your request What do you want us to build?

I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.

Use Cases:

Create a System pool to run components like karpenter, loadbalancer, coredns, etc
Provision baseline capacity that never scales down
Manually preprovision a set of nodes before a large event

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

sftim commented 7 months ago

We must have a way to run a minimum number of nodes in a nodepool.

You can already do that (run low-priority placeholder Pods), but AFAIK there's no controller that does exactly this. Maybe I'll put some time in and try to write one.

tuxillo commented 7 months ago

Same case here, we need to have a minimum set of nodes, which should be evenly spread among the AZs (AWS). We want to have always extra capacity available for the workload peaks and we can't wait for the spin up/down dance most of the time.

@sftim not sure but maybe the cluster autoscaler provides something similar? About the low-prio placeholder Pods, have you seen any good guide to do it? Sounds like a hack tho.

z0rc commented 7 months ago

Cluster Autoscaler documents how to overprovision cluster to offset node provisioning by running preemptable pods. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler

In the same time this solution isn't exclusive to Cluster Autoscalers, it works just fine with Karpenter and any other potential autoscaler. I wouldn't consider it as a hack, it's implemented via stable kubernetes resources using common practices.

There is ready to use helm chart at https://github.com/deliveryhero/helm-charts/tree/master/stable/cluster-overprovisioner.

tallaxes commented 7 months ago

Cluster Autoscaler now also supports ProvisioningRequest CRD

evanlivingston commented 7 months ago

There is an additional financial impact caused by the de-facto solution for a warm pool of capacity by overprovisioning. When high priority pods are scheduled, preempting the overprovision workloads, Karpenter will scale the nodegroup immediately in order to regain capacity to reschedule the overprovision workloads. In some cases it may be preferable for the headroom to be elastic, such that the desired headroom is restored after a deployment or batch job completes. This could be accomplished by setting a desired range of overprovisioned resources. But I also believe @sftim minimumPodPriority solution presents an acceptable solution.

jwcesign commented 6 months ago

Cluster Autoscaler now also supports ProvisioningRequest CRD

Does Karpenter have a plan to implement this? It's really helpful for AI workloads.

riiv-hexagon commented 6 months ago

+1

ellistarn commented 6 months ago

Cross ref-ing the ProvisioningRequest ask here: https://github.com/kubernetes-sigs/karpenter/issues/742#issuecomment-2122005473

jackfrancis commented 6 months ago

cc @raywainman who is tracking warm replicas stories on behalf of WG Serving here:

https://docs.google.com/document/d/1QsN4ubjerEqo5L4bQamOFFS2lmCv5zNPis2Z8gcIITg

jonathan-innis commented 5 months ago

For everyone's context, we did a little bit of ideating and came-up with an API that we were pretty happy with from the Karpenter side (see https://github.com/jonathan-innis/karpenter-headroom-poc). We're having an open discussion with the CAS folks about the difference between how we are thinking about the Headroom API and the ProvisioningRequest API, feel free to take a look and comment on the doc if you have any thoughts: https://docs.google.com/document/d/1SyqStWUt407Rcwdtv25yG6MpHdNnbfB3KPmc4zQuz1M/edit?usp=sharing

balmha commented 4 months ago

Hi guys, how's it going?. I have been looking for a fixed provisioning solution with Karpenter-

We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.

Some workaround or thougts about how to solve it, we really want to fully use Karpenter for our workload.

jwcesign commented 4 months ago

Hi, @balmha

Based on Karpenter, we developed a feature to ensure a minimum number of non-spot replicas for each workload, as illustrated below:

Under the hood, it's a webhook component that monitors the distribution of each workload and modifies the pods' affinity to prefer spot instances while requiring some to run on-demand. You can check it out here: CloudPilot Console.

This is not a promotion, just a technical communication.

gazal-k commented 4 months ago

We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.

You don't need this issue resolved for your use case @balmha.

Create 1 NodePool with requirement karpenter.sh/capacity-type set to ["on-demand"] and higher weight. Set limits to control how much on-demand "base capacity" you need.
Create another NodePool with requirement karpenter.sh/capacity-type set to ["spot", "on-demand"] (it'll schedule spot nodes and only fallback to on-demand in case of availability issues) with lower weight.

Even better, you can let your workloads that can't tolerate interruptions, set nodeSelectors or affinity to run on on-demand nodes. That's something which is cleaner to do with Karpenter than on CA.

stevehipwell commented 4 months ago

@jonathan-innis I really like the look of the headroom APIs, I think they'd cover the requirements I was talking about in #993. Is there a separate issue to track the headroom APIs? Google docs are blocked from our corporate machines.

Scrat94 commented 2 months ago

Is this still planned to be implemented? Any kind of ETA?

Our use case also would greatly benefit from it: We scale up GitLab Runners for our organization. However the cold-starts (60-90 secs) is not ideal for CI/CD. Having 1-2 nodes always available as headroom would make sure that every pipeline coming in, directly get started. Ideally, the minimum number of nodes can be set for a specific schedule (e.g. office hours only) to make sure that during night or weekend it ramps down to zero to minimize costs. (Or is there any other approach I can choose to fulfill my usecase with e.g. EKS + Karpenter?)

viettranquoc commented 1 month ago

Our team is using blue/green deployment and always needs to wait a few minutes to make the new nodes come. And then Karpenter will disrupt nodes (after deployment, there will be an overuse of resources). It would be great if we could allow a buffer like CPU or Memory. For example, if we constantly use a 1000m CPU, it would be a 1200m—20% buffer for deployment and cronjob running.

woehrl01 commented 3 weeks ago

One suggestion for always have a fixed set of nodes available is having a deployment with proper topologyspreadconstraints +pdb to force the pods one node per replica. (set very low resource requirements)

Don't set a lower priority so it won't get prempted (as in the prewarm/headroom Szenario mentioned before).

If you combine this with keda you can do that even dynamically basted on external factors, e.g. cron time

kubernetes-sigs / karpenter

Mega Issue: Manual node provisioning #749

Community Note