Open alculquicondor opened 3 years ago
https://www.kubeflow.org/docs/about/contributing/#joining-the-kubeflow-github-org
Hi, could you please join the kubeflow org? Then we do not need to trigger the CICD for your PR manually.
Sent PR kubeflow/internal-acls#473
Thanks for the suggestion
I verified that images docker.io/kubeflow/mpi-horovod-mnist
and docker.io/mpioperator/tensorflow-benchmarks
just work with the new controller. Marking that as done.
@alculquicondor Has community discussed tradeoffs about job vs pod for launcher, statefulsets vs plain pods for workers?
Yes for launcher. See the discussion here #386
For workers, it's still open for discussion. We could do Statefulsets, but I think plain pods might be fine for now. We might migrate to Indexed Jobs at some point, but since it's only available in k8s 1.22, it's kind of early to discuss.
I think this is pretty much ready. The last things I would like to do are:
* Add documentation (is there a website, or should I just do it on readmes)?
There's this page https://www.kubeflow.org/docs/components/training/mpi/
Maybe we can introduce Indexed Job to mpi-operator v2 once https://github.com/kubernetes/enhancements/issues/3715 is graduated to beta.
Consider introducing JobSet instead of managing raw pods for the workers: https://github.com/kubernetes-sigs/jobset
Implementation for https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md