kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.47k stars 262 forks source link

TAS: support rank-ordering for Pods #3533

Open mimowo opened 1 week ago

mimowo commented 1 week ago

What would you like to be added:

For Jobs which provide indexing (like batch/Job) we should place Pods with consecutive indexes (ranks) should be placed as close as possible in the topology tree.

The current implementation places pods pretty much randomly (as they show up in the API server).

Example, we have a jobs with 10pods: 0,1,2,3,4,5,6,7,8,9. We have 3 racks, each with 4 slots.

Why is this needed:

For improved performance of network communication between pods. This is especially important for AI/ML frameworks, where the pods exchange data in the ring structure (like in NCCL).

It is part of https://github.com/kubernetes-sigs/kueue/issues/3450

mimowo commented 1 week ago

/assign

mimowo commented 1 week ago

cc @PBundyra @mwysokin @mwielgus @tenzen-y

mimowo commented 6 days ago

An extension of this we hear would be useful for our users is to support PodGroups which are managed and indexed by an external controller.

I think that the current (label lookup-based in TopologyUngater, here) mechanism creates the incentive to support this by adding k8s-reserved labels to the custom controllers, which is not healthy. To resolve this issue I propose API at the workload level in PodSetTopologyRequest, called podIndexLabel (or 2 more to also abstract JobSet, jobIdexLabel and replicatedJobCount). With this API the implemetation of the GenericJob interface will set the values depending on the framework. For pod groups we could have a label reserved by kueue, like kueue.x-k8s.io/pod-group-index (or like that).

We need a TAS KEP extension for that, and @PBundyra agreed tentatively to work on it.

cc @tenzen-y @mwysokin @mwielgus

mimowo commented 6 days ago

/assign @PBundyra for the pod groups support and the generalized API

mimowo commented 6 days ago

/assign @mbobrovskyi for the Kubeflow indexes support. For now just lookup the pod index label in the TopologyUngater around here. It will be later generalized by the new API + e2e test for kubeflow indexing.