kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
440 stars 218 forks source link

Implement v2 controller that sets up SSH for communication #373

Open alculquicondor opened 3 years ago

alculquicondor commented 3 years ago

Implementation for https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md

gaocegege commented 3 years ago

https://www.kubeflow.org/docs/about/contributing/#joining-the-kubeflow-github-org

Hi, could you please join the kubeflow org? Then we do not need to trigger the CICD for your PR manually.

alculquicondor commented 3 years ago

Sent PR kubeflow/internal-acls#473

Thanks for the suggestion

alculquicondor commented 3 years ago

I verified that images docker.io/kubeflow/mpi-horovod-mnist and docker.io/mpioperator/tensorflow-benchmarks just work with the new controller. Marking that as done.

Jeffwan commented 3 years ago

@alculquicondor Has community discussed tradeoffs about job vs pod for launcher, statefulsets vs plain pods for workers?

alculquicondor commented 3 years ago

Yes for launcher. See the discussion here #386

For workers, it's still open for discussion. We could do Statefulsets, but I think plain pods might be fine for now. We might migrate to Indexed Jobs at some point, but since it's only available in k8s 1.22, it's kind of early to discuss.

alculquicondor commented 3 years ago

I think this is pretty much ready. The last things I would like to do are:

terrytangyuan commented 3 years ago
* Add documentation (is there a website, or should I just do it on readmes)?

There's this page https://www.kubeflow.org/docs/components/training/mpi/

tenzen-y commented 1 year ago

Maybe we can introduce Indexed Job to mpi-operator v2 once https://github.com/kubernetes/enhancements/issues/3715 is graduated to beta.

tenzen-y commented 1 year ago

Consider introducing JobSet instead of managing raw pods for the workers: https://github.com/kubernetes-sigs/jobset