Open wuxun-zhang opened 4 years ago
can you please post your mpirun
command line?
note you need to restrict the port range for both oob/tcp
and btl/tcp
@ggouaillardet Thanks for your quick reply. The command line is here. Did I miss something?
/opt/ubuntu/openmpi4.0/bin/mpirun \
-np ${num_proc} \
--hostfile ${hostfile} \
--bind-to socket \
--npersocket 1 \
--report-bindings \
-x KMP_AFFINITY=verbose,granularity=fine,noduplicates,compact,1,0 \
-x OMP_NUM_THREADS=${omp_threads} \
-mca oob_tcp_dynamic_ipv4_ports 8445-8455 \
-mca btl_tcp_port_min_v4 8445 \
-mca btl_tcp_port_range_v4 11 \
-mca oob_base_verbose 000 \
-mca pml ob1 \
-mca btl_base_verbose 000 \
-mca btl ^openib\
${workspace}/${single_script} ${network} ${node_count} ${num_proc}
Also put the output of ifconfig
here:
ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 172.29.133.44 netmask 255.255.255.224 broadcast 172.29.133.63
inet6 fe80::10db:a6ff:fee2:6568 prefixlen 64 scopeid 0x20<link>
ether 12:db:a6:e2:65:68 txqueuelen 1000 (Ethernet)
RX packets 110404 bytes 387516911 (387.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 82480 bytes 30599989 (30.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1729 bytes 211492 (211.4 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1729 bytes 211492 (211.4 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
the mpirun
command line looks fine and that could be a transient issue.
with netstat -anp
you can check which ports are available.
some of them might be in the FIN_WAIT
state, and you can simply wait they are released by the OS
(that might take up to 5 minutes)
I tried many times but fails every time. I just checked the output of netstat -anp
when running mpirun
. How can I figure out which ports are avalible now?
(base) ubuntu@ip-172-29-133-44:/opt/ubuntu/Multinode-training/dist_scripts$ netstat -anp
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8445 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8446 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8447 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8448 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8449 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8450 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8451 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8452 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8453 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8454 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:8455 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 127.0.0.1:46567 0.0.0.0:* LISTEN 4438/mpirun
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:46567 127.0.0.1:55012 ESTABLISHED 4438/mpirun
tcp 0 0 172.29.133.44:22 192.102.204.37:64626 ESTABLISHED -
tcp 0 360 172.29.133.44:22 192.102.204.37:1695 ESTABLISHED -
tcp 0 0 172.29.133.44:43670 172.29.133.46:2049 ESTABLISHED -
tcp 0 0 172.29.133.44:49742 172.29.133.60:22 ESTABLISHED 4442/ssh
tcp 0 0 127.0.0.1:55012 127.0.0.1:46567 ESTABLISHED 4457/python
tcp 0 0 127.0.0.1:55010 127.0.0.1:46567 ESTABLISHED 4458/python
tcp 0 0 172.29.133.44:8445 172.29.133.60:43626 ESTABLISHED 4438/mpirun
tcp 0 0 172.29.133.44:41510 52.46.128.123:443 ESTABLISHED -
tcp 0 0 127.0.0.1:46567 127.0.0.1:55010 ESTABLISHED 4438/mpirun
tcp 0 0 172.29.133.44:55396 52.94.228.178:443 ESTABLISHED -
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
udp 0 0 127.0.0.53:53 0.0.0.0:* -
udp 0 0 172.29.133.44:68 0.0.0.0:* -
udp 0 0 0.0.0.0:111 0.0.0.0:* -
udp 0 0 0.0.0.0:658 0.0.0.0:* -
udp6 0 0 :::111 :::* -
udp6 0 0 :::658 :::* -
raw6 0 0 :::58 :::* 7 -
@ggouaillardet Is this issue related to firmwalls or iptables configuration? I just added rich rule for these two IP addresses to avoid firmwalls, like sudo firewall-cmd --add-rich-rule='rule family="ipv4" source address="172.29.133.44" accept'
and sudo firewall-cmd --add-rich-rule='rule family="ipv4" source address="172.29.133.60
on both two nodes.
If you grep LISTEN
you can see that mpirun
is using ports 8445 to 8455 (and this is looks fishy...)
You should also do that on the remote node(s), kill the offending processes and wait before trying again.
It may be helpful to specify exact interfaces to use, for example: mpirun --mca btl_tcp_if_include eth1,eth2 or mpirun --mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 as indicated at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
OpenMPI 4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Build from source
Please describe the system on which you are running
Details of the problem
Since there is firmwall in my server, I opened 8445-8450 ports for mpi communication. but when I try to launch 4 processes on 2 nodes (2 processes per node). It will throw error like below.
In the end: