kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Support suspend semantics for MPIJob #511

Closed mimowo closed 1 year ago

mimowo commented 1 year ago

It solves: https://github.com/kubeflow/mpi-operator/issues/504

mimowo commented 1 year ago

@alculquicondor @tenzen-y WIP but ready for early feedback (would be good as this is my first PR in this repo). PTAL.

mimowo commented 1 year ago

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?

tenzen-y commented 1 year ago

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

I see, but is common reused by other subprojects too, right? So we would also need to copy the contents of common into these repos. Sounds like a lot of work, maybe simple, but the diffs will be big and one needs to be careful, so not sure we want to block the suspend work on that? Also, is this effort already planned, or in progress @alculquicondor @tenzen-y ?

Yes, that's right. We are using common repo in training-operator. However, we are planning to consolidate common codes to the training-operator repo.

https://github.com/kubeflow/training-operator/issues/1714

tenzen-y commented 1 year ago

FYI, the common repo is on its way to disappearing. I would suggest copying the RunPolicy struct here and adding the field.

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

mimowo commented 1 year ago

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

What about the other constants, like the once defining conditions? I guess we could have a PR to just copy RunPolicy to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.

tenzen-y commented 1 year ago

@alculquicondor I agree with adding a suspend member to Runpolicy. Although can we copy RunPolicy in a separate PR? Since I think copying the Runpolicy to this repo is another context with this PR.

What about the other constants, like the once defining conditions? I guess we could have a PR to just copy RunPolicy to mpi-operator to unblock this work, but keep the dependency on common@0.4.6 for the condition constants. Then, we can extend the set of MPIJob conditions by JobSuspended just in the mpi-operator. If this sounds good I can open a preparatory PR just to copy RunPolicy.

Sounds good to me. Although, let me know what other members think.

cc @alculquicondor @terrytangyuan

alculquicondor commented 1 year ago

sgtm

terrytangyuan commented 1 year ago

Sounds good

mimowo commented 1 year ago

@tenzen-y @alculquicondor I've opened the preparatory PR here: https://github.com/kubeflow/mpi-operator/pull/513. Please review.

mimowo commented 1 year ago

@terrytangyuan Please approve CI

mimowo commented 1 year ago

@alculquicondor @tenzen-y I fixed the issues reported so far and added tests (integration and e2e), so moving it out of WIP status. Please review.

tenzen-y commented 1 year ago

@mimowo Thanks for the great work! /lgtm

/assign @alculquicondor

mimowo commented 1 year ago

Can you add a unit test?

Done, ended up adding 3 actually: for creating suspended MPIJob, suspending if running and resuming. One thing is that to write the unit test for resuming I had to refactor the code a little bit to inject a fake clock in tests. Also, the test for suspending a running MPIJob revealed that I was requiring two syncs - one to clean up the pod workers and one to update the MPIJob status. Now, I do these steps in one sync.

Also, I'm not convinced of the value of an E2E test over unit+integration. Do you have a particular justification?

I think of two reasons:

  1. the test abstracts out the implementation details, thus is better at documenting what the feature is about. For example, it abstracts out when the service and other auxiliary objects are created. Thus, having such a test allows us to do refactoring and have confidence that the feature works before the unit, or integration tests are adjusted.
  2. it makes us more confident about race conditions. For example, suspending or resuming an MPIJob triggers suspending or resuming the Launcher Job which happens asynchronously. We don't really test the interaction between the launcher job and the MPIJob at other layers of testing.
alculquicondor commented 1 year ago

/lgtm /assign @terrytangyuan

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [terrytangyuan] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
terrytangyuan commented 1 year ago

The other PR got merged first so this one will need to resolve conflicts :-)

terrytangyuan commented 1 year ago

/lgtm

alculquicondor commented 1 year ago

Still lgtm