Support automatic quota detection in ClusterQueue based on available nodes

kubernetes-sigs / kueue

Kubernetes-native Job Queueing

https://kueue.sigs.k8s.io

Apache License 2.0

1.37k stars 248 forks source link

Support automatic quota detection in ClusterQueue based on available nodes #3183

Open andrewsykim opened 2 weeks ago

andrewsykim commented 2 weeks ago

What would you like to be added:

I would like to create a ClusterQueue resource that automatically contains quotas based on available node capacity in my Kubernetes cluster. I have configured my Kubernetes cluster with autoscaling and specify max nodes.

Why is this needed:

Caculating and adjusting quotas manually for ClusterQueue is toilsome. Often the total quota of a cluster is managed in the form of max nodes in a node pool. It would be great if the quota for a ClusterQueue can automatically detect total available capacity and dynamically set quotas for resources.

Completion requirements:

This enhancement requires the following artifacts:

[X] Design doc
[X] API change
[X] Docs update

The artifacts should be linked in subsequent comments.

andrewsykim commented 2 weeks ago

This would only work for clusters wtih a single ClusterQueue, but I think it would still be useful.

An alternative approach is allowing quotas to specify percentages instead of strict resource quantites

mimowo commented 2 weeks ago

Thank you for opening the discussion; this is definitely on our radar.

For autoscaled environments, Kueue would need to learn about the max-nodes configuration and understand the node resources to automatically adjust ClusterQueue quotas. I'm not sure we have a readily available API (like CA CRDs) to read this information from so it may require preparatory work in CA. It requires some exploration.

For non-autoscaling environments, we're working on Topology-Aware Scheduling. Part of this feature involves scraping node capacities, effectively limiting the quota based on the currently available nodes. We are not planning to support CA in the first iteration of TAS, but may revisit in the future iterations.

Expressing quotas as percentages within a cohort sounds useful to reduce the manual toil, and could be done as an independent feature. This concept is similar to the P&F (Priority and Fairness) configuration, with parameters like lendablePercent and borrowingLimitPercent.

/cc @mwielgus

andrewsykim commented 2 weeks ago

Thanks for the reply!

Kueue would need to learn about the max-nodes configuration and understand the node resources to automatically adjust ClusterQueue quotas.

Do we need to the max-nodes configuration from CA? Could we instead just watch for new nodes and dynamically adjust the quotas? I guess the challenge with either approach is there will be Pods on every ndoe (DaemonSets), that won't necessarily consume quotas from ClusterQueue, we would need user input to know how much resources from each node can be allocated to the quota. This would be not that different from supporting quotas as percentages

mimowo commented 2 weeks ago

Could we instead just watch for new nodes and dynamically adjust the quotas?

This is pretty much the approach we take in Topology Aware Scheduling (TAS), but in autoscaling environments you don't have the nodes until the workload is admitted (so kind of a chicken and egg problem?).

I guess the challenge with either approach is there will be Pods on every ndoe (DaemonSets), that won't necessarily consume quotas from ClusterQueue, we would need user input to know how much resources from each node can be allocated to the quota

Yeah, for TAS we plan to scrape the information about the Pod usage from DaemonSets. Either by watching DaemonSets or Pods directly.

andrewsykim commented 2 weeks ago

Thanks @mimowo, great to hear you're already thinking about this

mimowo commented 1 week ago

An alternative approach is allowing quotas to specify percentages instead of strict resource quantites

For this idea, you can achieve something very close to it with fair sharing. You can basically have a cohort, and assign fair sharing weights to the ClusterQueues within the cohort. The weights denote for priorities, and would translate for "percentages". For example, if you have 3 CQs, and you want them to share load in roughly 10%,20%,70% proportions you can assign the fair sharing weights as: 10, 20, 70. One reason these are weights rather than percenates is that it allows to mutating the values and the set of CQs without violating the sum to 100.

tenzen-y commented 1 week ago

As alternatively, I'm wondering if we can obtain the simulation result from the CA. Because AFAIK, the CA has Pod scheduling simulation mechanism, and the CA will resize the cluster based on the simulation result, right?

So, I'm curious if we can obtain the Pod Scheduling simulation result and auto-adjust the CQ configuration.

But, for the solution, we may need to pay massive development costs, I guess...