kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Adapt mpi-operator v2 to Kueue #505

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

Many users hope mpi-operator v2 adapts Kueue.

Blocked by #504 https://github.com/kubernetes-sigs/kueue/issues/369 Kueue side issue: https://github.com/kubernetes-sigs/kueue/issues/65

/kind feature

alculquicondor commented 1 year ago

I suppose you mean creating and syncing a Workload object for each MPIJob.

Do you think we should have this code in mpi-operator or in kueue?

tenzen-y commented 1 year ago

I suppose you mean creating and syncing a Workload object for each MPIJob.

Yes, I meant we implement a controller for Workload resources like the job-controller. https://github.com/kubernetes-sigs/kueue/blob/10d35322c252c2724467cf4617e79e94e1bd0c8a/pkg/controller/workload/job/job_controller.go

Do you think we should have this code in mpi-operator or in kueue?

IIUC, kueue is designed not to hold third-party dependencies. So we might need to add that code into mpi-operator, right?

alculquicondor commented 1 year ago

Yes, we could have this controller in any of the repos. Wherever it goes, we should have them for kueue+mpi-operator. Both repos have the setup for E2E tests.

Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).

I personally prefer to have it in mpi-operator if the OWNERS don't mind (cc @terrytangyuan)

@ahg-g, wdyt?

terrytangyuan commented 1 year ago

I am okay with either approach as long as there are E2E tests.

ahg-g commented 1 year ago

If we put it in mpi-operator repo, I suppose it will run as a reconciler within the same binary, not yet another operator, correct?

alculquicondor commented 1 year ago

Yes, same binary, possibly guarded by a command line flag to enable

tenzen-y commented 1 year ago

Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).

I'm also fine either way.

If we select the latter, my concern is when kueue will stop serving that controller and donate that controller to the mpi-operator repo.

If kueue keeps serving that controller, we might face why kueue does not provide other controllers (e.g., Ray, Argo, Spark, and more).

However, if we select the former, the mpi-operator may face API changes of kueue since kueue has alpha status as you say.

alculquicondor commented 1 year ago

We have maintainers in both sides, so I think we can manage :)

tenzen-y commented 1 year ago

Great! +1 for having it in this repo.

mwielgus commented 1 year ago

Kueue would like to have consistent integration model. We are grateful that MPI-operator is willing to add some Kueue specific code, but other frameworks may not be that welcoming and prefer to keep outside the code that doesn't have to be in their repo. From that perspective, doing the non-critical integration (other than suspend logic) outside of frameworks sounds like a better option.

tenzen-y commented 1 year ago

It sounds reasonable. I agree, @mwielgus. It isn't easy to convince all communities to have the workload-controller for kueue.

tenzen-y commented 1 year ago

We can close this issue and work on the kueue side. However, for a while, I would like to keep opening this issue to know what others think.

tenzen-y commented 1 year ago

/close

google-oss-prow[bot] commented 1 year ago

@tenzen-y: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/505#issuecomment-1434892413): >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.