kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

KEP-2170: Create MPI Runtime #2217

Open andreyvelich opened 3 months ago

andreyvelich commented 3 months ago

Related: https://github.com/kubeflow/training-operator/issues/2170

As part of this KEP, we will migrate to the MPI V2 implementation.

We should add support for the MPI Runtime.

/area runtime

tenzen-y commented 3 months ago

Note that we need to extend the KEP-2170 for the MPI before we implement anything.

tenzen-y commented 3 months ago

Note that we need to extend the KEP-2170 for the MPI before we implement anything.

Oh, we already added the design for the MPI here: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#the-mpi-spec-api

NVM

andreyvelich commented 3 months ago

Note that we need to extend the KEP-2170 for the MPI before we implement anything.

Oh, we already added the design for the MPI here: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#the-mpi-spec-api

NVM

Once we will be ready to implement MPI runtime, we should probably update this ClusterTrainingRuntime: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#mpi-runtime.

It might have incorrect values, since we didn't get a chance to finalize it.

tenzen-y commented 3 months ago

Note that we need to extend the KEP-2170 for the MPI before we implement anything.

Oh, we already added the design for the MPI here: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#the-mpi-spec-api NVM

Once we will be ready to implement MPI runtime, we should probably update this ClusterTrainingRuntime: https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#mpi-runtime.

It might have incorrect values, since we didn't get a chance to finalize it.

That sounds good to me.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 1 week ago

/remove-lifecycle stale