Open wang-mask opened 5 months ago
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Then this judgment condition will be meet, and the launcher pod will be created.
But we only create the launcher Job suspended.
Are you saying that the problem arises when you unsuspend the job and suddenly all the pods start?
Not sure if it's worth always creating the Job and only unsuspending if the workers are ready.
I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy
is set to "WaitForWorkersReady"
, even if the MPIJob is suspend.
The current logic is not very consistent with "WaitForWorkersReady"
.
I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy is set to "WaitForWorkersReady", even if the MPIJob is suspend.
I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?
I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?
In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady". Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?
My understanding of spec.launcherCreationPolicy is set to "WaitForWorkersReady"
is to wait for all workers to be ready before creating a launcher. So in a suspended state, the workers have not been created and also are not ready, the launcher should not be created.
The current code is designed to create a launcher, but it suspends it when the MPI job is suspended, regardless of the .launcherCreationPolicy. Does this meet expectations?
I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?
In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady".
Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?
In this situation, who does create the workers?
I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?
In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady". Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?
In this situation, who does create the workers?
The operation of unsuspending an MPI job triggers the creation of workers.
With this implementation: what happens if the job is running, then it is suspended and unsuspended?
Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.
With this implementation: what happens if the job is running, then it is suspended and unsuspended?
Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.
Now the controller deletes the worker pod and suspends the launcher when the mpi job is suspended.
Maybe it can also deletes the launcher when the mpi job is suspended and spec.launcherCreationPolicy = "WaitForWorkersReady"
. After second or subsequent suspension was lifted, the controller waits for all workers to be ready before creating a launcher.
If this implementation is reasonable, I would be happy to modify this PR.
Or you could still create the Job object but not flip the suspend flag until the workers are ready.
fixes #615
The original logic is:
If the MPIJob is
suspended
, thelen(worker)
andc.countReadyWorkerPods(worker)
would be both0
. Then this judgment condition will be meet, and the launcher pod will be created.