open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

openMPI dropped inbound connection #12918

Open KansaiTraining opened 3 days ago

KansaiTraining commented 3 days ago

I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation

I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got

5A301-0407-G5500-12:89116] btl: tcp: attempting to connect() to [[62864,0],0] address 10.3.29.82 on port 1031
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          5A301-0407-G5500-11
  Local PID:           49564
  Peer hostname:       5A301-0407-G5500-12 ([[62864,0],8])
  Source IP of socket: 10.3.29.53
  Known IPs of peer:   a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
--------------------------------------------------------------------------

I investigated and it seems node 11 can not communicate with node 12. One thing that bugs me is I don't know what a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d:: are, (yes IPv6) since: 1) the similar errors in the internet usually have alternative ipv4 IPs here 2) These IPv6 addresses can't be found anywhere when I do ip addr

I investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface

Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP

I have tried limiting the OMPI_MCA_btl_tcp_if_include to some values but only once the error disappeared but the process got stuck after that. I am at lost how to proceed further

bosilca commented 3 days ago

As the error message tries to explain, peer 5A301-0407-G5500-12 has only published a set of ipv6 addresses, but it is trying to initiate a connection via ipv4. OMPI drops it, as the source address is not part of the known list of addresses.

It is definitively node 12 trying to connect to node 11. Looking in the code this message is generated in the accept code, so on node 11.

Are you interfaces correctly setup on both nodes ? The routing tables are correct ? You can try disabling ipv6, and then limit the traffic to the usual network (not sure what you understand by the control network on your setup).