kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

Multiple MPI jobs via multiple launchers? #574

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

Dear all,

Is it possible to start multiple Launcher via replicas option as a way to run multiple mpirun invocations at once (within a single deployment)?

Submitting multiple deployments leads to some of the Pods hanging on pending. For example, MPIJob-1 and MPIJob-2 (1 launcher and 2 workers for both jobs) are submitted at the same time. This can cause for example MPIJob-1-launcher to be running and the worker pending forever or vise versa as the launcher from MPIJob1 started but one of the workers, for example, does not have enough resources to start and so on. Here is an actual example:

kubectl apply -f mpi_concat.yaml -f mpi_join.yaml
mpijob.kubeflow.org/concat created
mpijob.kubeflow.org/join created

kubectl get pods -w
NAME                          READY   STATUS    RESTARTS   AGE
concat-launcher-gbrtf   0/1     Pending   0          5s
concat-worker-0         1/1     Running   0          6s
concat-worker-1         1/1     Running   0          6s
concat-worker-2         0/1     Pending   0          6s
join-launcher-86zf6     0/1     Pending   0          6s
join-worker-0           1/1     Running   0          6s
join-worker-1           1/1     Running   0          6s

Am I missing something, or is my understanding of Kubeflow-mpi-operator is wrong, and is it not possible to do that? Also, is there an alternative way to have multiple MPIJobs to coexist at the same time in a coordinated manner?

alculquicondor commented 1 year ago

This is completely outside of the control of the mpi-operator.

You need to add a job queueing system (like https://kueue.sigs.k8s.io/docs/tasks/run_mpi_jobs/) or a gang scheduler.

tenzen-y commented 1 year ago

NOTE: The kueue doesn't guarantee that all pods are scheduled to Node at the same time (gang scheduling). So I would suggest using Job queueing by kueue with sequential admission and gang-scheduling by scheduler-plugins.

On my company's production, they work fine :)

AymenFJA commented 1 year ago

@alculquicondor @tenzen-y . Thank you so much, I really appreciate it, and things are way clearer now.

AymenFJA commented 1 year ago

@tenzen-y , sorry, I should have asked before closing this issue. Can you share some initial steps on the approach that you mentioned, please? I am struggling to find a tutorial of steps helping to reproduce the mentioned setup. I really appreciate it.

tenzen-y commented 1 year ago

@AymenFJA You can refer to the following documents:

AymenFJA commented 1 year ago

Thanks, @tenzen-y, for sharing.