Closed mimowo closed 1 year ago
@alculquicondor @tenzen-y WIP but ready for early feedback (would be good as this is my first PR in this repo). PTAL.
FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.
I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?
FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.
I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?
Yes, that's right. We are using common repo in training-operator. However, we are planning to consolidate common codes to the training-operator repo.
FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.
@alculquicondor I agree with adding a suspend
member to Runpolicy
. Although can we copy RunPolicy
in a separate PR? Since I think copying the Runpolicy
to this repo is another context with this PR.
@alculquicondor I agree with adding a
suspend
member toRunpolicy
. Although can we copyRunPolicy
in a separate PR? Since I think copying theRunpolicy
to this repo is another context with this PR.
What about the other constants, like the once defining conditions? I guess we could have a PR to just copy RunPolicy
to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.
@alculquicondor I agree with adding a
suspend
member toRunpolicy
. Although can we copyRunPolicy
in a separate PR? Since I think copying theRunpolicy
to this repo is another context with this PR.What about the other constants, like the once defining conditions? I guess we could have a PR to just copy
RunPolicy
to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.
Sounds good to me. Although, let me know what other members think.
cc @alculquicondor @terrytangyuan
sgtm
Sounds good
@tenzen-y @alculquicondor I've opened the preparatory PR here: https://github.com/kubeflow/mpi-operator/pull/513. Please review.
@terrytangyuan Please approve CI
@alculquicondor @tenzen-y I fixed the issues reported so far and added tests (integration and e2e), so moving it out of WIP status. Please review.
@mimowo Thanks for the great work! /lgtm
/assign @alculquicondor
Can you add a unit test?
Done, ended up adding 3 actually: for creating suspended MPIJob, suspending if running and resuming. One thing is that to write the unit test for resuming I had to refactor the code a little bit to inject a fake clock in tests. Also, the test for suspending a running MPIJob revealed that I was requiring two syncs - one to clean up the pod workers and one to update the MPIJob status. Now, I do these steps in one sync.
Also, I'm not convinced of the value of an E2E test over unit+integration. Do you have a particular justification?
I think of two reasons:
/lgtm /assign @terrytangyuan
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: terrytangyuan
The full list of commands accepted by this bot can be found here.
The pull request process is described here
The other PR got merged first so this one will need to resolve conflicts :-)
/lgtm
Still lgtm
It solves: https://github.com/kubeflow/mpi-operator/issues/504