For Jobs which provide indexing (like batch/Job) we should place Pods with consecutive indexes (ranks) should be placed as close as possible in the topology tree.
The current implementation places pods pretty much randomly (as they show up in the API server).
Example, we have a jobs with 10pods: 0,1,2,3,4,5,6,7,8,9. We have 3 racks, each with 4 slots.
Current possible ordering: [1,4,5,7][0,3,8,9][6,2] - suboptimal because communication 0-1,1-2, 2-3,3-4,5-6,6-7,7-8 cross the rack boundary and so will be slow
Wanted: [0,1,2,3][4,5,6,7][8,9] - optimal, only 0-9,3-4,7-8 cross the rank boundary
Why is this needed:
For improved performance of network communication between pods. This is especially important for AI/ML frameworks, where the pods exchange data in the ring structure (like in NCCL).
What would you like to be added:
For Jobs which provide indexing (like batch/Job) we should place Pods with consecutive indexes (ranks) should be placed as close as possible in the topology tree.
The current implementation places pods pretty much randomly (as they show up in the API server).
Example, we have a jobs with 10pods: 0,1,2,3,4,5,6,7,8,9. We have 3 racks, each with 4 slots.
Why is this needed:
For improved performance of network communication between pods. This is especially important for AI/ML frameworks, where the pods exchange data in the ring structure (like in NCCL).