hang here
I see that the process is busy all the time from top:
12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf 12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf
I am able to ssh to localhost.
I found that when i remove the localhost ip from the hostfile(replace it with other device's ip), the problem will not appear. but if i add the localhost's ip in the hostfile, the mpirun nccl-test will not work, will be hanging, and the cpu is high; I have tried many solutions to position the problem as followings:
first, i use mpirun command to give a simple test:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 hostname
it is successful, and show as following:
10.xx.3.239 10.x.1.147
I run the command:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO --mca btl tcp,self --mca oob_tcp_if_include eth0 all_reduce_perf -b 8 -e 8G -g 1 -f 2
but i also failed, it is hanging without any print.
Because there is no debug info, so i guess the problem is happened before the all_reduce_perf, the i tried to config more debug info for mpirun, as the command:
mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 8G -g 1 -f 2
the debug info is:
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:5 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:5 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 45 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/4
[10-43-1-147:1347858] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347858] tmp: /tmp
[10-43-1-147:1347859] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/5
[10-43-1-147:1347859] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347859] tmp: /tmp
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347858] mca: base: components_register: registering framework btl components
[10-43-1-147:1347858] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 120
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 120 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347859] mca: base: components_register: registering framework btl components
[10-43-1-147:1347859] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 395
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 395 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_register: component self register function successful
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 120 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 395 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 493 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 493
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 493 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_open: opening btl components
[10-43-1-147:1347858] mca: base: components_open: found loaded component self
[10-43-1-147:1347858] mca: base: components_open: component self open function successful
[10-43-1-147:1347858] select: initializing btl component self
[10-43-1-147:1347858] select: init of component self returned success
[10-43-1-147:1347859] mca: base: components_register: component self register function successful
[10-43-1-147:1347859] mca: base: components_open: opening btl components
[10-43-1-147:1347859] mca: base: components_open: found loaded component self
[10-43-1-147:1347859] mca: base: components_open: component self open function successful
[10-43-1-147:1347859] select: initializing btl component self
[10-43-1-147:1347859] select: init of component self returned success
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 5504
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 5504 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 5504 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 1827 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 1827
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 1827 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 29
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 29 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 80 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 29 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 80
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 80 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
thus I check the mpi code, I found the condition cause the hanging is in the orterun.c :
while (orte_event_base_active && completest.active) { //orte_event_base_active :alway true, not false
opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
}
ORTE_ACQUIRE_OBJECT(orte_event_base_active);
the variable orte_event_base_active is always true, haven't be changed to false. But I couldn't understand why the question happened.
At last, my env list is :
cuda: 12.3
mpirun: 4.1.7a1
nccl: 2.20.3
I compiled with MPI=1, but it hangs when i run the following:
ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
hang here I see that the process is busy all the time from top:
12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf 12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf
it does not hang if I run
I am able to ssh to localhost. I found that when i remove the localhost ip from the hostfile(replace it with other device's ip), the problem will not appear. but if i add the localhost's ip in the hostfile, the mpirun nccl-test will not work, will be hanging, and the cpu is high; I have tried many solutions to position the problem as followings:
first, i use mpirun command to give a simple test:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 hostname
it is successful, and show as following:10.xx.3.239 10.x.1.147
I run the command:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO --mca btl tcp,self --mca oob_tcp_if_include eth0 all_reduce_perf -b 8 -e 8G -g 1 -f 2
but i also failed, it is hanging without any print.Because there is no debug info, so i guess the problem is happened before the all_reduce_perf, the i tried to config more debug info for mpirun, as the command:
mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 8G -g 1 -f 2
the debug info is:thus I check the mpi code, I found the condition cause the hanging is in the orterun.c :
the variable orte_event_base_active is always true, haven't be changed to false. But I couldn't understand why the question happened.
At last, my env list is : cuda: 12.3 mpirun: 4.1.7a1 nccl: 2.20.3
please help me to resolve this question.