kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

fix the condition #617

Open wang-mask opened 5 months ago

wang-mask commented 5 months ago

fixes #615

The original logic is:

if launcher == nil {
    if mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup ||  c.countReadyWorkerPods(worker) == len(worker) {
        launcher, err = c.kubeClient.BatchV1().Jobs(namespace).Create(context.TODO(), c.newLauncherJob(mpiJob), metav1.CreateOptions{})
        ....
}

If the MPIJob is suspended, the len(worker) and c.countReadyWorkerPods(worker) would be both 0. Then this judgment condition will be meet, and the launcher pod will be created.

google-cla[bot] commented 5 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-oss-prow[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
alculquicondor commented 5 months ago

Then this judgment condition will be meet, and the launcher pod will be created.

But we only create the launcher Job suspended.

Are you saying that the problem arises when you unsuspend the job and suddenly all the pods start?

Not sure if it's worth always creating the Job and only unsuspending if the workers are ready.

wang-mask commented 5 months ago

I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy is set to "WaitForWorkersReady", even if the MPIJob is suspend. The current logic is not very consistent with "WaitForWorkersReady".

tenzen-y commented 5 months ago

I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy is set to "WaitForWorkersReady", even if the MPIJob is suspend.

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

wang-mask commented 5 months ago

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady". Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

wang-mask commented 5 months ago

My understanding of spec.launcherCreationPolicy is set to "WaitForWorkersReady" is to wait for all workers to be ready before creating a launcher. So in a suspended state, the workers have not been created and also are not ready, the launcher should not be created.

The current code is designed to create a launcher, but it suspends it when the MPI job is suspended, regardless of the .launcherCreationPolicy. Does this meet expectations?

tenzen-y commented 5 months ago

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady".

Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

In this situation, who does create the workers?

wang-mask commented 5 months ago

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady". Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

In this situation, who does create the workers?

The operation of unsuspending an MPI job triggers the creation of workers.

alculquicondor commented 5 months ago

With this implementation: what happens if the job is running, then it is suspended and unsuspended?

Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.

wang-mask commented 5 months ago

With this implementation: what happens if the job is running, then it is suspended and unsuspended?

Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.

Now the controller deletes the worker pod and suspends the launcher when the mpi job is suspended.

Maybe it can also deletes the launcher when the mpi job is suspended and spec.launcherCreationPolicy = "WaitForWorkersReady". After second or subsequent suspension was lifted, the controller waits for all workers to be ready before creating a launcher.

If this implementation is reasonable, I would be happy to modify this PR.

alculquicondor commented 5 months ago

Or you could still create the Job object but not flip the suspend flag until the workers are ready.