open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.07k stars 844 forks source link

mpirun nccl-test hang #12618

Open 913871734 opened 3 weeks ago

913871734 commented 3 weeks ago

I compiled with MPI=1, but it hangs when i run the following:

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

hang here I see that the process is busy all the time from top: 12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf 12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf

it does not hang if I run

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
 nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1

 Using devices
   Rank  0 Pid  12275 on ip-172-32-45-72 device  0 [0x00] Tesla V100-SXM2-32GB

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum     4.43    0.00    0.00  0e+00     0.31    0.03    0.00  0e+00
          16             4   float     sum     4.34    0.00    0.00  0e+00     0.31    0.05    0.00  0e+00
          32             8   float     sum     4.30    0.01    0.00  0e+00     0.30    0.11    0.00  0e+00
          64            16   float     sum     4.36    0.01    0.00  0e+00     0.30    0.21    0.00  0e+00
         128            32   float     sum     4.27    0.03    0.00  0e+00     0.30    0.42    0.00  0e+00
         256            64   float     sum     4.26    0.06    0.00  0e+00     0.30    0.85    0.00  0e+00
         512           128   float     sum     4.36    0.12    0.00  0e+00     0.30    1.70    0.00  0e+00
        1024           256   float     sum     4.33    0.24    0.00  0e+00     0.30    3.38    0.00  0e+00
        2048           512   float     sum     4.32    0.47    0.00  0e+00     0.31    6.66    0.00  0e+00
        4096          1024   float     sum     4.27    0.96    0.00  0e+00     0.30   13.47    0.00  0e+00
        8192          2048   float     sum     4.36    1.88    0.00  0e+00     0.30   26.96    0.00  0e+00
       16384          4096   float     sum     4.30    3.81    0.00  0e+00     0.30   53.84    0.00  0e+00
       32768          8192   float     sum     4.27    7.67    0.00  0e+00     0.30  108.23    0.00  0e+00
       65536         16384   float     sum     4.36   15.04    0.00  0e+00     0.30  215.47    0.00  0e+00
      131072         32768   float     sum     4.35   30.13    0.00  0e+00     0.30  430.24    0.00  0e+00
      262144         65536   float     sum     4.37   59.96    0.00  0e+00     0.30  864.73    0.00  0e+00
      524288        131072   float     sum     4.34  120.92    0.00  0e+00     0.30  1728.04    0.00  0e+00
     1048576        262144   float     sum     6.10  171.98    0.00  0e+00     0.30  3444.73    0.00  0e+00
     2097152        524288   float     sum     8.78  238.96    0.00  0e+00     0.31  6866.90    0.00  0e+00
     4194304       1048576   float     sum    14.64  286.48    0.00  0e+00     0.30  13751.82    0.00  0e+00
     8388608       2097152   float     sum    25.98  322.90    0.00  0e+00     0.30  27786.05    0.00  0e+00
    16777216       4194304   float     sum    47.48  353.35    0.00  0e+00     0.30  55942.70    0.00  0e+00
    33554432       8388608   float     sum    90.58  370.45    0.00  0e+00     0.30  111107.39    0.00  0e+00
    67108864      16777216   float     sum    176.7  379.73    0.00  0e+00     0.30  221517.95    0.00  0e+00
   134217728      33554432   float     sum    349.2  384.30    0.00  0e+00     0.30  443841.69    0.00  0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0

I am able to ssh to localhost. I found that when i remove the localhost ip from the hostfile(replace it with other device's ip), the problem will not appear. but if i add the localhost's ip in the hostfile, the mpirun nccl-test will not work, will be hanging, and the cpu is high; I have tried many solutions to position the problem as followings:

  1. first, i use mpirun command to give a simple test: mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 hostname it is successful, and show as following: 10.xx.3.239 10.x.1.147

  2. I run the command: mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO --mca btl tcp,self --mca oob_tcp_if_include eth0 all_reduce_perf -b 8 -e 8G -g 1 -f 2 but i also failed, it is hanging without any print.

  3. Because there is no debug info, so i guess the problem is happened before the all_reduce_perf, the i tried to config more debug info for mpirun, as the command: mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 8G -g 1 -f 2 the debug info is:

    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:5 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:5 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 45 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [10-43-1-147:1347858] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/4
    [10-43-1-147:1347858] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
    [10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0/jf.43885
    [10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0
    [10-43-1-147:1347858] tmp: /tmp
    [10-43-1-147:1347859] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/5
    [10-43-1-147:1347859] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
    [10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0/jf.43885
    [10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0
    [10-43-1-147:1347859] tmp: /tmp
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
    [10-43-1-147:1347858] mca: base: components_register: registering framework btl components
    [10-43-1-147:1347858] mca: base: components_register: found loaded component self
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 120
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 120 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [10-43-1-147:1347859] mca: base: components_register: registering framework btl components
    [10-43-1-147:1347859] mca: base: components_register: found loaded component self
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 395
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 395 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [10-43-1-147:1347858] mca: base: components_register: component self register function successful
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 120 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 395 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 493 BYTES ON SOCKET 18
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 493
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 493 BYTES FOR DEST [[43885,0],0] TAG 2
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
    [10-43-1-147:1347858] mca: base: components_open: opening btl components
    [10-43-1-147:1347858] mca: base: components_open: found loaded component self
    [10-43-1-147:1347858] mca: base: components_open: component self open function successful
    [10-43-1-147:1347858] select: initializing btl component self
    [10-43-1-147:1347858] select: init of component self returned success
    [10-43-1-147:1347859] mca: base: components_register: component self register function successful
    [10-43-1-147:1347859] mca: base: components_open: opening btl components
    [10-43-1-147:1347859] mca: base: components_open: found loaded component self
    [10-43-1-147:1347859] mca: base: components_open: component self open function successful
    [10-43-1-147:1347859] select: initializing btl component self
    [10-43-1-147:1347859] select: init of component self returned success
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 5504
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 5504 BYTES FOR DEST [[43885,0],0] TAG 33
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 5504 BYTES ON SOCKET 18
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 1827 BYTES ON SOCKET 23
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 1827
    [10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 1827 BYTES FOR DEST [[43885,0],1] TAG 15
    [10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 29
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 29 BYTES FOR DEST [[43885,0],0] TAG 33
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
    [dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 80 BYTES ON SOCKET 23
    [10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
    [10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
    [10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
    [10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
    [10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 29 BYTES ON SOCKET 18
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
    [10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 80
    [10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 80 BYTES FOR DEST [[43885,0],1] TAG 15
    [10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1

thus I check the mpi code, I found the condition cause the hanging is in the orterun.c :

    while (orte_event_base_active && completest.active) {          //orte_event_base_active :alway true, not false
        opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
    }
    ORTE_ACQUIRE_OBJECT(orte_event_base_active);

the variable orte_event_base_active is always true, haven't be changed to false. But I couldn't understand why the question happened.

At last, my env list is : cuda: 12.3 mpirun: 4.1.7a1 nccl: 2.20.3

please help me to resolve this question.

913871734 commented 3 weeks ago

the link of same issue I have submit to nccl group is: https://github.com/NVIDIA/nccl-tests/issues/26