kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 698 forks source link

Migrate v2 MPI operator to the unified operator #1479

Open terrytangyuan opened 2 years ago

terrytangyuan commented 2 years ago

Now that v1 MPI operator has been migrated to this repo https://github.com/kubeflow/training-operator/pull/1457. Let's use this issue to track the progress on v2.

https://github.com/kubeflow/mpi-operator/tree/master/v2

cc @hackerboy01 @zw0610 @alculquicondor @kubeflow/wg-training-leads

andreyvelich commented 2 years ago

@alculquicondor What is the status for MPI Operator v2 ? Do we have plans to deliver MPI Operator v2 as part of Universal Training Operator in Kubeflow 1.5 ? The Kubeflow 1.5 release deadline is January 15th.

alculquicondor commented 2 years ago

We need a contributor to do it. I don't currently have capacity to handle it. That means that likely it wouldn't be possible for January 15th. But I don't think the v1 operator is ready either.

terrytangyuan commented 2 years ago

cc @ArangoGutierrez

johnugeorge commented 1 year ago

I want to resurrect this thread. There have been many asks from the community to have v2 mpi operator in training operator. Currently, newer features are merged into v2 mpi. Time have passed since the last discussion and v2 api is stable now. What is our plan here regarding migration? What are the road blocks here? There is confusion in the community the future of v1 mpi as well.

Can we prioritise this? @alculquicondor @terrytangyuan @tenzen-y

tenzen-y commented 1 year ago

IIRC, we are planning to donate mpi-operator v2 to kubernetes-sigs. So we should decide whether donate to the kubernetes-sigs or merge the v2 operator to the training-operator, to avoid double management.

https://github.com/kubeflow/community/pull/557

cc: @ArangoGutierrez @denkensk @ahg-g

kuizhiqing commented 1 year ago

Do we have any new plan here ? Since donate mpi-operator v2 to kubernetes-sigs is seems aborted, should we merge mpi-operator v2 to training-operator ?

terrytangyuan commented 1 year ago

There's also discussion around donating Spark-on-K8s project to Kubeflow (no open issue yet since we are still waiting for governance update). I personally think that project is similar to MPI Operator which not just focus on training. So I am not sure if MPI Operator would be a good fit for training-operator.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.