Topology Aware Scheduling

mimowo commented 1 month ago

What would you like to be added:

Ability to control how closely the pods are packed on nodes in a data center.

Currently, a user of Kueue, like AI/ML researcher, has no way of telling "run this workload so that all pods are on nodes within a rack (or block)". Running a workload with Pods scattered across a data center results in longer runtimes, and thus costs.

Why is this needed:

To reduce the codes of running AI/ML workloads which require exchanging huge amounts of data over network.

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[ ] Docs update

The artifacts should be linked in subsequent comments.

mimowo commented 1 month ago

/assign

mimowo commented 1 month ago

/cc @mwielgus

tenzen-y commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

mimowo commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

tenzen-y commented 1 month ago

@mimowo What is the reason that you do not prefer ResourceFlavor taints instead of dedicated fields? If I am missing any context, please let me know.

Sure, I will be happy to explain, but I'm not sure I understand: which fields do you mean?

Maybe this is related to your question (I'm not sure 100%), but a RF can have a set of labels which have nothing to do with topology. For example, they can be to choose a GPU family.

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

mimowo commented 1 month ago

Let me check the "GPU family" mean. Which K8s features can be represented the GPU family? Node Label? or Node Taints? or other features?

This was just an example, what I meant is that nodes have labels. Some labels correspond to topology (the new ones, for example cloud-provider.com/topology-block, or cloud-provider.com/topology-rack), and some don't (like cloud.google.com/machine-family).

Maybe it can be clearer when looking at the example table in: https://github.com/kubernetes-sigs/kueue/blob/5d7847bed87ffa353732164de229b0f94aeab8bd/keps/2724-topology-aware-schedling/README.md#hierarchy-representation.

I think two things are important for design choice:

it is not feasible for an admin to create RFs per rack to match it using the existing API if you have thousands or racks in a cluster
some workloads may not fit within a single rack. Still, we want Kueue to compactify the placement of pods so that the number of used racks is minimal. So, some pods with have the value of the label cloud-provider.com/topology-rack: rack1 while others cloud-provider.com/topology-rack: rack2. This is not expressible with the current API.

I think we can discuss specific details of the API or alternatives in the KEP.

KPostOffice commented 2 weeks ago

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

tenzen-y commented 1 week ago

@tenzen-y, how quickly will this slam the queuing algorithm if each rack needs to be treated as a different flavor? I know there's limits on the number of flavors that can be defined by a ClusterQueue currently at around 8 or so. @mimowo mentioned thousands of racks. I get the feeling that this should be handled at the scheduler level not at the queuing level.

@KPostOffice Thank you for catching up and giving me your feedback. I added a similar concern here: https://github.com/kubernetes-sigs/kueue/pull/2725#discussion_r1754907510

Let's discuss that in the KEP PR.

kubernetes-sigs / kueue

Topology Aware Scheduling #2724