Closed tenzen-y closed 1 year ago
I suppose you mean creating and syncing a Workload object for each MPIJob.
Do you think we should have this code in mpi-operator or in kueue?
I suppose you mean creating and syncing a Workload object for each MPIJob.
Yes, I meant we implement a controller for Workload resources like the job-controller. https://github.com/kubernetes-sigs/kueue/blob/10d35322c252c2724467cf4617e79e94e1bd0c8a/pkg/controller/workload/job/job_controller.go
Do you think we should have this code in mpi-operator or in kueue?
IIUC, kueue is designed not to hold third-party dependencies. So we might need to add that code into mpi-operator, right?
Yes, we could have this controller in any of the repos. Wherever it goes, we should have them for kueue+mpi-operator. Both repos have the setup for E2E tests.
Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).
I personally prefer to have it in mpi-operator if the OWNERS don't mind (cc @terrytangyuan)
@ahg-g, wdyt?
I am okay with either approach as long as there are E2E tests.
If we put it in mpi-operator repo, I suppose it will run as a reconciler within the same binary, not yet another operator, correct?
Yes, same binary, possibly guarded by a command line flag to enable
Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).
I'm also fine either way.
If we select the latter, my concern is when kueue will stop serving that controller and donate that controller to the mpi-operator repo.
If kueue keeps serving that controller, we might face why kueue does not provide other controllers (e.g., Ray, Argo, Spark, and more).
However, if we select the former, the mpi-operator may face API changes of kueue since kueue has alpha status as you say.
We have maintainers in both sides, so I think we can manage :)
Great! +1 for having it in this repo.
Kueue would like to have consistent integration model. We are grateful that MPI-operator is willing to add some Kueue specific code, but other frameworks may not be that welcoming and prefer to keep outside the code that doesn't have to be in their repo. From that perspective, doing the non-critical integration (other than suspend logic) outside of frameworks sounds like a better option.
It sounds reasonable. I agree, @mwielgus. It isn't easy to convince all communities to have the workload-controller for kueue.
We can close this issue and work on the kueue side. However, for a while, I would like to keep opening this issue to know what others think.
/close
@tenzen-y: Closing this issue.
Many users hope mpi-operator v2 adapts Kueue.
Blocked by #504 https://github.com/kubernetes-sigs/kueue/issues/369 Kueue side issue: https://github.com/kubernetes-sigs/kueue/issues/65
/kind feature