UCX ERROR try to increase "net.core.somaxconn", "net.core.netdev_max_backlog", "net.ipv4.tcp_max_syn_backlog" to the maximum value on the remote node or increase UCX_TCP_MAX_CONN_RETRIES (=25) #5126
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue or cat /etc/redhat-release + uname -a
For RDMA/IB/RoCE related issues:
Driver version:
rpm -q rdma-core or rpm -q libibverbs
or: MLNX_OFED version ofed_info -s
HW information from ibstat or ibv_devinfo -vv command
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# uname -r
3.10.0-957.el7.x86_64
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# more /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.42.5000
Hardware version: 1
Node GUID: 0xe41d2d0300733ec0
System image GUID: 0xe41d2d0300733ec0
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xe61d2dfffe733ec0
Link layer: Ethernet
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xe61d2dfffe733ec1
Link layer: Ethernet
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#
For GPU related issues:
GPU type
Cuda:
Drivers version
Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
Additional information (depending on the issue)
OpenMPI version
Output of ucx_info -d to show transports and devices recognized by UCX
Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
log:
1588535089.198081] [client12:28850:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.199714] [client12:28916:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28850] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.215761] [client12:28850:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[client12:28916] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.217371] [client12:28916:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[1588535089.218706] [client12:28861:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[1588535085.905218] [client13:18734:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
1588535085.920082] [client13:18815:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.920358] [client13:18815:0] mpool.c:43 UCX WARN object 0x7fece4887ee8 was not returned to mpool mm_recv_desc^M
[1588535089.239031] [client12:28958:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[client13:18734] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.244577] [client12:28820:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28820:0:28820] Caught signal 7 (Bus error: Sent by the kernel)^M
[1588535089.245677] [client12:28859:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[1588535085.928352] [client13:18734:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[client12:28861] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.247507] [client12:28861:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[client13:18815] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.249375] [client12:28958:0] mpool.c:43 UCX WARN object 0x7f049fffaee8 was not returned to mpool mm_recv_desc^M
[1588535085.933806] [client13:18784:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.938354] [client13:18815:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.938354] [client13:18815:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[1588535089.258257] [client12:28845:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28958] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.261231] [client12:28856:0] mm_posix.c:195 UCX ERROR open(file_name=/proc/28861/fd/38 flags=0x0) failed: No such file or directory^M
[1588535089.261271] [client12:28856:0] mm_ep.c:149 UCX ERROR mm ep failed to connect to remote FIFO id 0xc0000009800070bd: Shared memory error^M
[LOG_CAT_P2P] UCX returned connect error: Shared memory error^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.261384] [client12:28958:0] sock.c:344 UCX ERROR recv(fd=45) failed: Bad address^M
[1588535089.261464] [client12:28859:0] mpool.c:43 UCX WARN object 0x7f1a1557eee8 was not returned to mpool mm_recv_desc^M
[1588535089.264691] [client12:28962:0] mm_posix.c:195 UCX ERROR open(file_name=/proc/28958/fd/38 flags=0x0) failed: No such file or directory^M
[1588535089.264723] [client12:28962:0] mm_ep.c:149 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000098000711e: Shared memory error^M
[LOG_CAT_P2P] UCX returned connect error: Shared memory error^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.948761] [client13:18807:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.267397] [client12:28942:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[1588535089.267948] [client12:28907:0] select.c:434 UCX ERROR no active messages transport to <no debug data>: Unsupported operation^M
Describe the bug
A clear and concise description of what the bug is. mpirun failing for more than 360 tasks scheduled on 10 clients:
Steps to Reproduce
Command line mpirun -v --oversubscribe --allow-run-as-root -x LD_LIBRARY_PATH --hostfile mfile.1 -np $NP ./mdtest -F -n $NF -i 1 -d "$TESTDIR"
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
)Any UCX environment variables used ucx_info -v
UCT version=1.8.0 revision d5d6218
configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem --with-xpmem=/hpc/local/oss/xpmem --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda10.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64/ucx
[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
command [root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# uname -r 3.10.0-957.el7.x86_64 [root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# more /etc/redhat-release CentOS Linux release 7.6.1810 (Core) [root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xe41d2d0300733ec0 System image GUID: 0xe41d2d0300733ec0 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffe733ec0 Link layer: Ethernet Port 2: State: Down Physical state: Disabled Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffe733ec1 Link layer: Ethernet [root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXlog: