kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

Port conficts will occur when multiple pods dispatched to the same node under hostnetwork. #593

Open Saturnoul opened 1 year ago

Saturnoul commented 1 year ago

Hostnetwork needs to be enabled to utilize RDMA for high performance transmission. In such circumstance, there will be port conflicts when multiple pods dispatched to the same node for the following reasons:

Can mpi-operator handles the port conflicts under hostnetwork?

alculquicondor commented 1 year ago

Why not have one pod per node when using hostNetwork?

It would be convoluted to dynamically generate a port for each pod and put that into the hostfile.

alculquicondor commented 1 year ago

FWIIW, we use port 2222 by default in the base image, so that you can use hostNetwork without the pod's ssh agent conflicting with the host's ssh agent https://github.com/kubeflow/mpi-operator/blob/6d713b48e3617c0bb2567fedca624787a6108818/Makefile#L22

tenzen-y commented 1 year ago

FYI: Instead of hostNetwork, you could use SRIO-V device plugin and multus CNI.

In my production, the way works well.