Multiple MPI jobs via multiple launchers?

AymenFJA commented 1 year ago

Dear all,

Is it possible to start multiple Launcher via replicas option as a way to run multiple mpirun invocations at once (within a single deployment)?

Submitting multiple deployments leads to some of the Pods hanging on pending. For example, MPIJob-1 and MPIJob-2 (1 launcher and 2 workers for both jobs) are submitted at the same time. This can cause for example MPIJob-1-launcher to be running and the worker pending forever or vise versa as the launcher from MPIJob1 started but one of the workers, for example, does not have enough resources to start and so on. Here is an actual example:

kubectl apply -f mpi_concat.yaml -f mpi_join.yaml
mpijob.kubeflow.org/concat created
mpijob.kubeflow.org/join created

kubectl get pods -w
NAME                          READY   STATUS    RESTARTS   AGE
concat-launcher-gbrtf   0/1     Pending   0          5s
concat-worker-0         1/1     Running   0          6s
concat-worker-1         1/1     Running   0          6s
concat-worker-2         0/1     Pending   0          6s
join-launcher-86zf6     0/1     Pending   0          6s
join-worker-0           1/1     Running   0          6s
join-worker-1           1/1     Running   0          6s

Am I missing something, or is my understanding of Kubeflow-mpi-operator is wrong, and is it not possible to do that? Also, is there an alternative way to have multiple MPIJobs to coexist at the same time in a coordinated manner?

alculquicondor commented 1 year ago

This is completely outside of the control of the mpi-operator.

You need to add a job queueing system (like https://kueue.sigs.k8s.io/docs/tasks/run_mpi_jobs/) or a gang scheduler.

tenzen-y commented 1 year ago

NOTE: The kueue doesn't guarantee that all pods are scheduled to Node at the same time (gang scheduling). So I would suggest using Job queueing by kueue with sequential admission and gang-scheduling by scheduler-plugins.

On my company's production, they work fine :)

AymenFJA commented 1 year ago

@alculquicondor @tenzen-y . Thank you so much, I really appreciate it, and things are way clearer now.

AymenFJA commented 1 year ago

@tenzen-y , sorry, I should have asked before closing this issue. Can you share some initial steps on the approach that you mentioned, please? I am struggling to find a tutorial of steps helping to reproduce the mentioned setup. I really appreciate it.

tenzen-y commented 1 year ago

@AymenFJA You can refer to the following documents:

Kueue with sequential admissions: https://kueue.sigs.k8s.io/docs/tasks/setup_sequential_admission/
MPIJob with Kueue: https://kueue.sigs.k8s.io/docs/tasks/run_mpi_jobs/
Coscheduling Plugin: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling
MPIJob with Coscheduling Plugin:
- https://www.kubeflow.org/docs/components/training/mpi/#scheduling-policy
- https://www.kubeflow.org/docs/components/training/job-scheduling/#scheduler-plugins-with-coscheduling

AymenFJA commented 1 year ago

Thanks, @tenzen-y, for sharing.

kubeflow / mpi-operator

Multiple MPI jobs via multiple launchers? #574