kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
441 stars 218 forks source link

What scale can mpi-operator support? #648

Open yxzhao6 opened 2 months ago

yxzhao6 commented 2 months ago

We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs. We are looking to adopt mpi-operator, or slurm.

slurm is widely adopted in large-scale hpc computing, so its scalability is well tested.

Is there cases of mpi-operator's benchmark results on > 3000 gpus clsuter?

tenzen-y commented 2 months ago

@terrytangyuan @alculquicondor Do you have any benchmark results?

alculquicondor commented 2 months ago

At that scale, the limitations don't come from mpi-operator, but from the network and how pods will land on it.

Do you have more details?

terrytangyuan commented 2 months ago

Agreed. I don't think there's anything from the controller side that blocks the scale. I don't have any public benchmark.