open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

horovod distributed failed run in k8s with kube-router #6447

Open wangqiaoshi opened 5 years ago

wangqiaoshi commented 5 years ago

Thank you for taking the time to submit an issue!

Background information

image:mpioperator/tensorflow-benchmarks:latest k8s v1.8.2

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.2 root@ab8c7f1afca5:/tensorflow/benchmarks# mpirun --version mpirun.real (OpenRTE) 3.1.2

Report bugs to http://www.open-mpi.org/community/help/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running


The situation is similar to this https://github.com/open-mpi/ompi/issues/6103

rhc54 commented 5 years ago

Would you please tell us what didn't work? Did you get some kind of error messages? did it just hang?

wangqiaoshi commented 5 years ago

Would you please tell us what didn't work? Did you get some kind of error messages? did it just hang?

Thank you for your answer!
yes, no error messages,just hang. Final message is `The inbound connection has been dropped, and the peer should simply try again with a different IP interface (i.e., the job should hopefully be able to continue).

Local host: tensorflow-benchmarks-gpu-worker-0 Local PID: 298 Peer hostname: (null) ([[49239,1],1]) Source IP of socket: 172.17.91.0 Known IPs of peer:
0.0.0.0 ------------------`

I think it's also an ip-masquerade problem.

houseonline commented 5 years ago

Yes, that's the problem with ip-masquerade. When this parameter is disabled, the solution can solve the problem perfectly.

KansaiTraining commented 2 days ago

When this parameter is disabled, the solution can solve the problem perfectly

what parameter exactly?