kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

Expected ssh contract that must be followed by images to use this operator #580

Closed aavbsouza closed 1 year ago

aavbsouza commented 1 year ago

Hello. Is there a formal contract or documentation about the needs of this operator with respect ssh communication on the setup phase of one distributed job. For instance here (https://github.com/kubeflow/mpi-operator/tree/master/build/base) is built some images that are able to communicate without root, with ssh. For a custom docker image what would be the expectation?

thanks

tenzen-y commented 1 year ago

Actually, there aren't any docs for the custom images. But you can copy from https://github.com/kubeflow/mpi-operator/blob/18250f5e6980ce3afbf86359f8aa5fa9ac6cf831/build/base/Dockerfile#L3-L31 to your Dockerfile.

Then, you can add your custom setting. Actually, in my env, the solution works well, although I use nvcr.io/nvidia/pytorch:23.05-py3 as a base image.

tenzen-y commented 1 year ago

If you have any other questions, feel free to re-open this issue. /close

google-oss-prow[bot] commented 1 year ago

@tenzen-y: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/580#issuecomment-1668580521): >If you have any other questions, feel free to re-open this issue. >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.