Open yxzhao6 opened 2 months ago
@terrytangyuan @alculquicondor Do you have any benchmark results?
At that scale, the limitations don't come from mpi-operator, but from the network and how pods will land on it.
Do you have more details?
Agreed. I don't think there's anything from the controller side that blocks the scale. I don't have any public benchmark.
We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs. We are looking to adopt mpi-operator, or slurm.
slurm is widely adopted in large-scale hpc computing, so its scalability is well tested.
Is there cases of mpi-operator's benchmark results on > 3000 gpus clsuter?