openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 428 forks source link

UCX ERROR try to increase "net.core.somaxconn", "net.core.netdev_max_backlog", "net.ipv4.tcp_max_syn_backlog" to the maximum value on the remote node or increase UCX_TCP_MAX_CONN_RETRIES (=25) #5126

Open aasraoui opened 4 years ago

aasraoui commented 4 years ago

Describe the bug

A clear and concise description of what the bug is. mpirun failing for more than 360 tasks scheduled on 10 clients:

Steps to Reproduce

Setup and versions

[root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xe41d2d0300733ec0 System image GUID: 0xe41d2d0300733ec0 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffe733ec0 Link layer: Ethernet Port 2: State: Down Physical state: Disabled Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffe733ec1 Link layer: Ethernet [root@client4 hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.5-1.0.1.0-redhat7.6-x86_64]#

Additional information (depending on the issue)

log:

1588535089.198081] [client12:28850:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.199714] [client12:28916:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28850] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.215761] [client12:28850:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[client12:28916] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.217371] [client12:28916:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[1588535089.218706] [client12:28861:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[1588535085.905218] [client13:18734:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
1588535085.920082] [client13:18815:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.920358] [client13:18815:0]          mpool.c:43   UCX  WARN  object 0x7fece4887ee8 was not returned to mpool mm_recv_desc^M
[1588535089.239031] [client12:28958:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[client13:18734] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.244577] [client12:28820:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28820:0:28820] Caught signal 7 (Bus error: Sent by the kernel)^M
[1588535089.245677] [client12:28859:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[1588535085.928352] [client13:18734:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[client12:28861] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[1588535089.247507] [client12:28861:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[client13:18815] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.249375] [client12:28958:0]          mpool.c:43   UCX  WARN  object 0x7f049fffaee8 was not returned to mpool mm_recv_desc^M
[1588535085.933806] [client13:18784:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.938354] [client13:18815:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M

LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.938354] [client13:18815:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[1588535089.258257] [client12:28845:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[client12:28958] Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.261231] [client12:28856:0]       mm_posix.c:195  UCX  ERROR open(file_name=/proc/28861/fd/38 flags=0x0) failed: No such file or directory^M
[1588535089.261271] [client12:28856:0]          mm_ep.c:149  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc0000009800070bd: Shared memory error^M
[LOG_CAT_P2P] UCX returned connect error: Shared memory error^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.261384] [client12:28958:0]           sock.c:344  UCX  ERROR recv(fd=45) failed: Bad address^M
[1588535089.261464] [client12:28859:0]          mpool.c:43   UCX  WARN  object 0x7f1a1557eee8 was not returned to mpool mm_recv_desc^M
[1588535089.264691] [client12:28962:0]       mm_posix.c:195  UCX  ERROR open(file_name=/proc/28958/fd/38 flags=0x0) failed: No such file or directory^M
[1588535089.264723] [client12:28962:0]          mm_ep.c:149  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000098000711e: Shared memory error^M
[LOG_CAT_P2P] UCX returned connect error: Shared memory error^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535085.948761] [client13:18807:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
[LOG_CAT_P2P] UCX returned connect error: Destination is unreachable^M
[LOG_CAT_P2P] hmca_bcol_ucx_p2p_preconnect() failed^M
[1588535089.267397] [client12:28942:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy, rc_verbs/mlx4_0:1 - Destination is unreachable, ud_verbs/mlx4_0:1 - Destination is unreachable^M
[1588535089.267948] [client12:28907:0]         select.c:434  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation^M
dmitrygx commented 4 years ago

@aasraoui I see a lot of errors from UCX's shared-memory transport, but don't see the error from the title of the issue. Could you check pls?

dmitrygx commented 3 years ago

@aasraoui ping