open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

[btl_tcp_component.c:966:mca_btl_tcp_component_create_listen] bind() failed: no port available in the range #7130

Open wuxun-zhang opened 4 years ago

wuxun-zhang commented 4 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OpenMPI 4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Build from source

Please describe the system on which you are running


Details of the problem

Since there is firmwall in my server, I opened 8445-8450 ports for mpi communication. but when I try to launch 4 processes on 2 nodes (2 processes per node). It will throw error like below.

[ip-172-29-133-60:04262] oob:tcp:send_handler SENDING MSG
[ip-172-29-133-60:04272] mca: base: components_register: registering framework btl components
[ip-172-29-133-60:04272] mca: base: components_register: found loaded component tcp
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8445
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8446
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8447
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8448
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8449
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8450
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8451
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8452
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8453
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8454
[ip-172-29-133-44:04059] btl:tcp: Attempting to bind to AF_INET port 8455
[ip-172-29-133-44][[9486,1],0][btl_tcp_component.c:966:mca_btl_tcp_component_create_listen] bind() failed: no port available in the range [8445..8456]
[ip-172-29-133-44:04059] select: init of component tcp returned failure
[ip-172-29-133-44:04059] mca: base: close: component tcp closed
[ip-172-29-133-44:04059] mca: base: close: unloading component tcp

In the end:

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[9486,1],2]) is on host: ip-172-29-133-60
  Process 2 ([[9486,1],0]) is on host: ip-172-29-133-44
  BTLs attempted: tcp vader self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[ip-172-29-133-44:04040] [[9486,0],0]:tcp:recv:handler called for peer [[9486,0],1]
[ip-172-29-133-44:04040] [[9486,0],0]:tcp:recv:handler CONNECTED
[ip-172-29-133-44:04040] [[9486,0],0]:tcp:recv:handler allocate new recv msg
[ip-172-29-133-60:04262] [[9486,0],1] OOB_SEND: rml_oob_send.c:265
[ip-172-29-133-60:04262] [[9486,0],1] oob:base:send to target [[9486,0],0] - attempt 0
[ip-172-29-133-60:04262] [[9486,0],1] oob:base:send known transport for peer [[9486,0],0]
[ip-172-29-133-60:04262] [[9486,0],1] oob:tcp:send_nb to peer [[9486,0],0]:2 seq = -1
[ip-172-29-133-60:04262] [[9486,0],1]:[oob_tcp.c:198] processing send to peer [[9486,0],0]:2 seq_num = -1 via [[9486,0],0]
[ip-172-29-133-44:04040] [[9486,0],0]:tcp:recv:handler read hdr
[ip-172-29-133-44:04040] [[9486,0],0]:tcp:recv:handler allocate data region of size 173
[ip-172-29-133-44:04040] [[9486,0],0] RECVD COMPLETE MESSAGE FROM [[9486,0],1] (ORIGIN [[9486,0],1]) OF 173 BYTES FOR DEST [[9486,0],0] TAG 2
[ip-172-29-133-60:04262] [[9486,0],1] tcp:send_nb: already connected to [[9486,0],0] - queueing for send
[ip-172-29-133-60:04262] [[9486,0],1]:[oob_tcp.c:208] queue send to [[9486,0],0]
[ip-172-29-133-60:04262] [[9486,0],1] tcp:send_handler called to send to peer [[9486,0],0]
[ip-172-29-133-44:04040] [[9486,0],0] DELIVERING TO RML tag = 2 seq_num = -1
Error in MPI_Isend(139756493935584, 1, 0x7f1b9e373e80, 0, -27, 139756595289664) (-12)
Error in NBC_Start_round() (-12)
Error in NBC_Start_round() (-1)
[ip-172-29-133-60:04262] [[9486,0],1] tcp:send_handler SENDING TO [[9486,0],0]
[ip-172-29-133-60:04262] oob:tcp:send_handler SENDING MSG
[ip-172-29-133-60:04262] [[9486,0],1] MESSAGE SEND COMPLETE TO [[9486,0],0] OF 173 BYTES ON SOCKET 17
ggouaillardet commented 4 years ago

can you please post your mpirun command line? note you need to restrict the port range for both oob/tcp and btl/tcp

wuxun-zhang commented 4 years ago

@ggouaillardet Thanks for your quick reply. The command line is here. Did I miss something?

/opt/ubuntu/openmpi4.0/bin/mpirun \
    -np ${num_proc} \
    --hostfile ${hostfile} \
    --bind-to socket \
    --npersocket 1 \
    --report-bindings \
    -x KMP_AFFINITY=verbose,granularity=fine,noduplicates,compact,1,0 \
    -x OMP_NUM_THREADS=${omp_threads} \
    -mca oob_tcp_dynamic_ipv4_ports 8445-8455 \
    -mca btl_tcp_port_min_v4 8445 \
    -mca btl_tcp_port_range_v4 11 \
    -mca oob_base_verbose 000 \
    -mca pml ob1 \
    -mca btl_base_verbose 000 \
    -mca btl ^openib\
    ${workspace}/${single_script} ${network} ${node_count} ${num_proc}

Also put the output of ifconfig here:

ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 172.29.133.44  netmask 255.255.255.224  broadcast 172.29.133.63
        inet6 fe80::10db:a6ff:fee2:6568  prefixlen 64  scopeid 0x20<link>
        ether 12:db:a6:e2:65:68  txqueuelen 1000  (Ethernet)
        RX packets 110404  bytes 387516911 (387.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 82480  bytes 30599989 (30.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1729  bytes 211492 (211.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1729  bytes 211492 (211.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
ggouaillardet commented 4 years ago

the mpirun command line looks fine and that could be a transient issue.

with netstat -anp you can check which ports are available. some of them might be in the FIN_WAIT state, and you can simply wait they are released by the OS (that might take up to 5 minutes)

wuxun-zhang commented 4 years ago

I tried many times but fails every time. I just checked the output of netstat -anp when running mpirun. How can I figure out which ports are avalible now?

(base) ubuntu@ip-172-29-133-44:/opt/ubuntu/Multinode-training/dist_scripts$ netstat -anp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:8445            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8446            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8447            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8448            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8449            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8450            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8451            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8452            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8453            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8454            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:8455            0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 127.0.0.1:46567         0.0.0.0:*               LISTEN      4438/mpirun
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:46567         127.0.0.1:55012         ESTABLISHED 4438/mpirun
tcp        0      0 172.29.133.44:22        192.102.204.37:64626    ESTABLISHED -
tcp        0    360 172.29.133.44:22        192.102.204.37:1695     ESTABLISHED -
tcp        0      0 172.29.133.44:43670     172.29.133.46:2049      ESTABLISHED -
tcp        0      0 172.29.133.44:49742     172.29.133.60:22        ESTABLISHED 4442/ssh
tcp        0      0 127.0.0.1:55012         127.0.0.1:46567         ESTABLISHED 4457/python
tcp        0      0 127.0.0.1:55010         127.0.0.1:46567         ESTABLISHED 4458/python
tcp        0      0 172.29.133.44:8445      172.29.133.60:43626     ESTABLISHED 4438/mpirun
tcp        0      0 172.29.133.44:41510     52.46.128.123:443       ESTABLISHED -
tcp        0      0 127.0.0.1:46567         127.0.0.1:55010         ESTABLISHED 4438/mpirun
tcp        0      0 172.29.133.44:55396     52.94.228.178:443       ESTABLISHED -
tcp6       0      0 :::111                  :::*                    LISTEN      -
tcp6       0      0 :::22                   :::*                    LISTEN      -
udp        0      0 127.0.0.53:53           0.0.0.0:*                           -
udp        0      0 172.29.133.44:68        0.0.0.0:*                           -
udp        0      0 0.0.0.0:111             0.0.0.0:*                           -
udp        0      0 0.0.0.0:658             0.0.0.0:*                           -
udp6       0      0 :::111                  :::*                                -
udp6       0      0 :::658                  :::*                                -
raw6       0      0 :::58                   :::*                    7           -
wuxun-zhang commented 4 years ago

@ggouaillardet Is this issue related to firmwalls or iptables configuration? I just added rich rule for these two IP addresses to avoid firmwalls, like sudo firewall-cmd --add-rich-rule='rule family="ipv4" source address="172.29.133.44" accept' and sudo firewall-cmd --add-rich-rule='rule family="ipv4" source address="172.29.133.60 on both two nodes.

ggouaillardet commented 4 years ago

If you grep LISTEN you can see that mpirun is using ports 8445 to 8455 (and this is looks fishy...) You should also do that on the remote node(s), kill the offending processes and wait before trying again.

bkmgit commented 3 years ago

It may be helpful to specify exact interfaces to use, for example: mpirun --mca btl_tcp_if_include eth1,eth2 or mpirun --mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 as indicated at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-network