kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.33k stars 234 forks source link

Topology Aware Scheduling #2724

Open mimowo opened 1 month ago

mimowo commented 1 month ago

What would you like to be added:

Ability to control how closely the pods are packed on nodes in a data center.

Currently, a user of Kueue, like AI/ML researcher, has no way of telling "run this workload so that all pods are on nodes within a rack (or block)". Running a workload with Pods scattered across a data center results in longer runtimes, and thus costs.

Why is this needed:

To reduce the codes of running AI/ML workloads which require exchanging huge amounts of data over network.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

mimowo commented 1 month ago

/assign

mimowo commented 1 month ago

/cc @mwielgus

tenzen-y commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

mimowo commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

tenzen-y commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

mimowo commented 1 month ago

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

This was just an example, what I meant is that nodes have labels. Some labels correspond to topology (the new ones, for example cloud-provider.com/topology-block, or cloud-provider.com/topology-rack), and some don't (like cloud.google.com/machine-family).

Maybe it can be clearer when looking at the example table in: https://github.com/kubernetes-sigs/kueue/blob/5d7847bed87ffa353732164de229b0f94aeab8bd/keps/2724-topology-aware-schedling/README.md#hierarchy-representation.

I think two things are important for design choice:

I think we can discuss specific details of the API or alternatives in the KEP.

KPostOffice commented 2 weeks ago

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

tenzen-y commented 1 week ago

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

@KPostOffice Thank you for catching up and giving me your feedback. I added a similar concern here: https://github.com/kubernetes-sigs/kueue/pull/2725#discussion_r1754907510

Let's discuss that in the KEP PR.