Open KansaiTraining opened 3 days ago
As the error message tries to explain, peer 5A301-0407-G5500-12
has only published a set of ipv6 addresses, but it is trying to initiate a connection via ipv4. OMPI drops it, as the source address is not part of the known list of addresses.
It is definitively node 12 trying to connect to node 11. Looking in the code this message is generated in the accept
code, so on node 11.
Are you interfaces correctly setup on both nodes ? The routing tables are correct ? You can try disabling ipv6, and then limit the traffic to the usual network (not sure what you understand by the control network on your setup).
I have found a couple of issues that seem similar to this one but I can't relate if they have been solved or how they apply to my situation
I am running slurm with srun using openMPI and when I run a job using only one node it completes (with some warnings) but when I run it on two nodes I got
I investigated and it seems node 11 can not communicate with node 12. One thing that bugs me is I don't know what
a03:190d::a03:190d::, a03:1871::a03:1871::, a03:1d53::a03:1d53::, a03:1d0d::a03:1d0d::
are, (yes IPv6) since: 1) the similar errors in the internet usually have alternative ipv4 IPs here 2) These IPv6 addresses can't be found anywhere when I do ip addrI investigated further and 10.3.29.53 is Node 12's 25G RoCEv2 Network interface Also 10.3.29.82 (the one in the verbose log above) is Node 11's 25G RoCEv2 Control Network interface
Another thing that confuses me is the log says Node12 is attempting to connect to Node11 RoCEv2 control network but the error seems that on the contrary Node 11 is trying to connect to Node 12 but on an unexpected IP
I have tried limiting the
OMPI_MCA_btl_tcp_if_include
to some values but only once the error disappeared but the process got stuck after that. I am at lost how to proceed further