Open wangqiaoshi opened 5 years ago
Would you please tell us what didn't work? Did you get some kind of error messages? did it just hang?
Would you please tell us what didn't work? Did you get some kind of error messages? did it just hang?
Thank you for your answer!
yes, no error messages,just hang. Final message is
`The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: tensorflow-benchmarks-gpu-worker-0
Local PID: 298
Peer hostname: (null) ([[49239,1],1])
Source IP of socket: 172.17.91.0
Known IPs of peer:
0.0.0.0
------------------`
I think it's also an ip-masquerade problem.
Yes, that's the problem with ip-masquerade. When this parameter is disabled, the solution can solve the problem perfectly.
When this parameter is disabled, the solution can solve the problem perfectly
what parameter exactly?
Thank you for taking the time to submit an issue!
Background information
image:mpioperator/tensorflow-benchmarks:latest k8s v1.8.2
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.1.2 root@ab8c7f1afca5:/tensorflow/benchmarks# mpirun --version mpirun.real (OpenRTE) 3.1.2
Report bugs to http://www.open-mpi.org/community/help/
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Please describe the system on which you are running
The situation is similar to this https://github.com/open-mpi/ompi/issues/6103