kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

Open yzhao-2023 opened 6 months ago

yzhao-2023 commented 6 months ago

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock:

  1. WaitForWorkersReady is enabled, mpi operator created a pod group with only worker pod spec (N being worker pod count), but with the desired pod count being N+1 (N worker + 1 launcher)
  2. Gang scheduler would not scheduler this pod group, because there is not enough pods in the podgroup
  3. MPI operator would not create launcher pod spec, because the worker pods are not created yet.

A workaround, albeit still violating gang scheduling's semantic, is to set runPolicy.minAvailable to be the worker count, allowing mpi operator to create pod group with only worker pods, and allowing gang scheduler to proceed scheduling workers.

The problem is the strict semantic of gang scheduling is being broken, and the launcher might be able to be scheduled.

In reality, this should not be a problem, as launcher job does not consume gpus, therefore should be amply available in our case.

But the doc should be updated to reflect this pitfall.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count). Risking launcher not be started.

A possible true fix: Extend Kubernetes to have resources being allocated, but not immediately start running the pods. So that launcher can be executed after workers have been started.

[0] https://www.kubeflow.org/docs/components/training/mpi/#scheduling-policy [1] https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319

alculquicondor commented 6 months ago

Does volcano offer an API to declare the size of the group beforehand?

Otherwise, there is nothing we can do in this repo.

You might also want to consider https://kueue.sig.k8s.io which doesn't face this issue because it's not pod-based.

tenzen-y commented 6 months ago

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock

@yzhao-2023 That's right, WaitForWorkersReady potentially has the deadlock.

But the doc should be updated to reflect this pitfall.

Anyway, we should add documentation about WaitForWorkersReady since there isn't any document about the feature.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count). Risking launcher not be started.

I don't want to add such a defaulting since users might be confused by the modified input value. I belive that validation would be better.

Does volcano offer an API to declare the size of the group beforehand?

@alculquicondor We can tell an arbitrary number to the volcano via PodGroup (runPolicy.minAvailable) here:

https://github.com/kubeflow/mpi-operator/blob/4a63d3cb35454d072c63fc84aeb5766878701ead/pkg/controller/podgroup.go#L130-L131

alculquicondor commented 6 months ago

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them. Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

tenzen-y commented 6 months ago

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them. Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

Ah, I see. yes, that's right. We don't have any way to tell a shape to volcano/scheduler-plugins. So, I believe that validations would be worth it. It means users can not create a MPIJob with waitForWorkersReady and N , where N is the sum of all workers and a launcher.