pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
79.87k stars 21.48k forks source link

all_reduce misaligned address with bfloat16 #119345

Open alexisVallet opened 4 months ago

alexisVallet commented 4 months ago

🐛 Describe the bug

Hi! I am encountering the following error when using torch.distributed.all_reduce on bfloat16 tensors of a certain size using NCCL: RuntimeError: CUDA error: misaligned address.

I can only reproduce this with a large enough number of GPUs - in my environment this occurs with 8 GPUs but not with 2 GPUs for instance. This issue seems environment specific, as I could not reproduce it everywhere I tried, though I can reproduce it on multiple nodes of the same cluster. I believe this is likely a NCCL bug, but I could not reproduce it consistently with nccl-tests for instance, only with Pytorch. EDIT: after further testing, I could also reproduce this consistently with nccl-test.

Some more potentially relevant environment information that collect_env.py didn't pick up:

I also tried various driver versions, cuda versions, pytorch versions, but this error always occurs with some bfloat16 tensor size (though not necessarily the one in this example).

Minimal example to reproduce on my environment:

import torch
from torch.distributed import init_process_group, get_rank, all_reduce

def main():
    init_process_group(backend="nccl")
    rank = get_rank()
    device = torch.device(f"cuda:{rank}")
    test_tensor = torch.ones([23528522], dtype=torch.bfloat16, device=device)
    all_reduce(test_tensor)
    if rank == 0:
        print(f"{test_tensor=}")

if __name__ == "__main__":
    main()

Running it with:

NCCL_COMM_BLOCKING=1 CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO torchrun --standalone --nproc_per_node 8 -- main.py

Results in this log:

$ NCCL_COMM_BLOCKING=1 CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO torchrun --standalone --nproc_per_node 8 -- main.py
[2024-02-07 11:08:58,469] torch.distributed.run: [WARNING] 
[2024-02-07 11:08:58,469] torch.distributed.run: [WARNING] *****************************************
[2024-02-07 11:08:58,469] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-07 11:08:58,469] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [ik1-02]:54237 (errno: 22 - Invalid argument).
ik1-02:118796:118796 [0] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118796:118796 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118796:118796 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
ik1-02:118796:118796 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118798:118798 [2] NCCL INFO cudaDriverVersion 12030
ik1-02:118798:118798 [2] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118798:118798 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118798:118798 [2] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118801:118801 [5] NCCL INFO cudaDriverVersion 12030
ik1-02:118801:118801 [5] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118801:118801 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118801:118801 [5] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118802:118802 [6] NCCL INFO cudaDriverVersion 12030
ik1-02:118802:118802 [6] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118802:118802 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118802:118802 [6] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118799:118799 [3] NCCL INFO cudaDriverVersion 12030
ik1-02:118799:118799 [3] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118799:118799 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118799:118799 [3] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118800:118800 [4] NCCL INFO cudaDriverVersion 12030
ik1-02:118800:118800 [4] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118800:118800 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118800:118800 [4] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118797:118797 [1] NCCL INFO cudaDriverVersion 12030
ik1-02:118797:118797 [1] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118797:118797 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118797:118797 [1] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118803:118803 [7] NCCL INFO cudaDriverVersion 12030
ik1-02:118803:118803 [7] NCCL INFO Bootstrap : Using p1p0:192.168.1.3<0>
ik1-02:118803:118803 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ik1-02:118803:118803 [7] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
ik1-02:118796:118890 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118796:118890 [0] NCCL INFO Using non-device net plugin version 0
ik1-02:118796:118890 [0] NCCL INFO Using network IB
ik1-02:118798:118891 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118798:118891 [2] NCCL INFO Using non-device net plugin version 0
ik1-02:118798:118891 [2] NCCL INFO Using network IB
ik1-02:118801:118893 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118801:118893 [5] NCCL INFO Using non-device net plugin version 0
ik1-02:118801:118893 [5] NCCL INFO Using network IB
ik1-02:118799:118895 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118799:118895 [3] NCCL INFO Using non-device net plugin version 0
ik1-02:118799:118895 [3] NCCL INFO Using network IB
ik1-02:118800:118896 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118800:118896 [4] NCCL INFO Using non-device net plugin version 0
ik1-02:118800:118896 [4] NCCL INFO Using network IB
ik1-02:118797:118897 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118797:118897 [1] NCCL INFO Using non-device net plugin version 0
ik1-02:118797:118897 [1] NCCL INFO Using network IB
ik1-02:118803:118898 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118803:118898 [7] NCCL INFO Using non-device net plugin version 0
ik1-02:118803:118898 [7] NCCL INFO Using network IB
ik1-02:118802:118894 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_6:1/RoCE [3]mlx5_8:1/RoCE [4]mlx5_bond_0:1/RoCE [RO]; OOB p1p0:192.168.1.3<0>
ik1-02:118802:118894 [6] NCCL INFO Using non-device net plugin version 0
ik1-02:118802:118894 [6] NCCL INFO Using network IB
ik1-02:118801:118893 [5] NCCL INFO comm 0x5556640db830 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118800:118896 [4] NCCL INFO comm 0x559fc16e5a80 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118802:118894 [6] NCCL INFO comm 0x564d85ca57d0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId ba000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118799:118895 [3] NCCL INFO comm 0x5617814127f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118798:118891 [2] NCCL INFO comm 0x55d9f39299b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118796:118890 [0] NCCL INFO comm 0x55f80db6c060 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118803:118898 [7] NCCL INFO comm 0x5578a2409740 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118797:118897 [1] NCCL INFO comm 0x563beef52860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x2d7b59e9603b1325 - Init START
ik1-02:118801:118893 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffffff,ff000000,00000000
ik1-02:118801:118893 [5] NCCL INFO NVLS multicast support is available on dev 5
ik1-02:118796:118890 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff
ik1-02:118796:118890 [0] NCCL INFO NVLS multicast support is available on dev 0
ik1-02:118800:118896 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffffff,ff000000,00000000
ik1-02:118800:118896 [4] NCCL INFO NVLS multicast support is available on dev 4
ik1-02:118799:118895 [3] NCCL INFO Setting affinity for GPU 3 to ffffff,ffffffff
ik1-02:118799:118895 [3] NCCL INFO NVLS multicast support is available on dev 3
ik1-02:118802:118894 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffffff,ff000000,00000000
ik1-02:118802:118894 [6] NCCL INFO NVLS multicast support is available on dev 6
ik1-02:118798:118891 [2] NCCL INFO Setting affinity for GPU 2 to ffffff,ffffffff
ik1-02:118797:118897 [1] NCCL INFO Setting affinity for GPU 1 to ffffff,ffffffff
ik1-02:118803:118898 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffffff,ff000000,00000000
ik1-02:118803:118898 [7] NCCL INFO NVLS multicast support is available on dev 7
ik1-02:118797:118897 [1] NCCL INFO NVLS multicast support is available on dev 1
ik1-02:118798:118891 [2] NCCL INFO NVLS multicast support is available on dev 2
ik1-02:118796:118890 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
ik1-02:118797:118897 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
ik1-02:118796:118890 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
ik1-02:118797:118897 [1] NCCL INFO P2P Chunksize set to 524288
ik1-02:118796:118890 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
ik1-02:118803:118898 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
ik1-02:118798:118891 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
ik1-02:118802:118894 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
ik1-02:118796:118890 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
ik1-02:118801:118893 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
ik1-02:118803:118898 [7] NCCL INFO P2P Chunksize set to 524288
ik1-02:118800:118896 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
ik1-02:118798:118891 [2] NCCL INFO P2P Chunksize set to 524288
ik1-02:118802:118894 [6] NCCL INFO P2P Chunksize set to 524288
ik1-02:118796:118890 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
ik1-02:118801:118893 [5] NCCL INFO P2P Chunksize set to 524288
ik1-02:118800:118896 [4] NCCL INFO P2P Chunksize set to 524288
ik1-02:118796:118890 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
ik1-02:118799:118895 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
ik1-02:118796:118890 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
ik1-02:118799:118895 [3] NCCL INFO P2P Chunksize set to 524288
ik1-02:118796:118890 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
ik1-02:118796:118890 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
ik1-02:118796:118890 [0] NCCL INFO P2P Chunksize set to 524288
ik1-02:118801:118893 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Connected all rings
ik1-02:118799:118895 [3] NCCL INFO Connected all rings
ik1-02:118797:118897 [1] NCCL INFO Connected all rings
ik1-02:118796:118890 [0] NCCL INFO Connected all rings
ik1-02:118798:118891 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118798:118891 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118797:118897 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
ik1-02:118799:118895 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Connected all rings
ik1-02:118799:118895 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Connected all rings
ik1-02:118803:118898 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Connected all rings
ik1-02:118802:118894 [6] NCCL INFO Connected all rings
ik1-02:118799:118895 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118803:118898 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118802:118894 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118800:118896 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM
ik1-02:118801:118893 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM
ik1-02:118796:118890 [0] NCCL INFO Connected all trees
ik1-02:118796:118890 [0] NCCL INFO NVLS comm 0x55f80db6c060 headRank 0 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118797:118897 [1] NCCL INFO Connected all trees
ik1-02:118799:118895 [3] NCCL INFO Connected all trees
ik1-02:118798:118891 [2] NCCL INFO Connected all trees
ik1-02:118797:118897 [1] NCCL INFO NVLS comm 0x563beef52860 headRank 1 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118799:118895 [3] NCCL INFO NVLS comm 0x5617814127f0 headRank 3 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118798:118891 [2] NCCL INFO NVLS comm 0x55d9f39299b0 headRank 2 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118803:118898 [7] NCCL INFO Connected all trees
ik1-02:118800:118896 [4] NCCL INFO Connected all trees
ik1-02:118802:118894 [6] NCCL INFO Connected all trees
ik1-02:118801:118893 [5] NCCL INFO Connected all trees
ik1-02:118803:118898 [7] NCCL INFO NVLS comm 0x5578a2409740 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118801:118893 [5] NCCL INFO NVLS comm 0x5556640db830 headRank 5 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118800:118896 [4] NCCL INFO NVLS comm 0x559fc16e5a80 headRank 4 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118802:118894 [6] NCCL INFO NVLS comm 0x564d85ca57d0 headRank 6 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
ik1-02:118797:118897 [1] NCCL INFO Connected NVLS tree
ik1-02:118797:118897 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118797:118897 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118798:118891 [2] NCCL INFO Connected NVLS tree
ik1-02:118798:118891 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118798:118891 [2] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118799:118895 [3] NCCL INFO Connected NVLS tree
ik1-02:118799:118895 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118799:118895 [3] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118800:118896 [4] NCCL INFO Connected NVLS tree
ik1-02:118800:118896 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118800:118896 [4] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118801:118893 [5] NCCL INFO Connected NVLS tree
ik1-02:118801:118893 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118801:118893 [5] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118802:118894 [6] NCCL INFO Connected NVLS tree
ik1-02:118802:118894 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118802:118894 [6] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118796:118890 [0] NCCL INFO Connected NVLS tree
ik1-02:118803:118898 [7] NCCL INFO Connected NVLS tree
ik1-02:118803:118898 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118803:118898 [7] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118796:118890 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
ik1-02:118796:118890 [0] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
ik1-02:118800:118896 [4] NCCL INFO comm 0x559fc16e5a80 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118796:118890 [0] NCCL INFO comm 0x55f80db6c060 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118802:118894 [6] NCCL INFO comm 0x564d85ca57d0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId ba000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118803:118898 [7] NCCL INFO comm 0x5578a2409740 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118801:118893 [5] NCCL INFO comm 0x5556640db830 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118798:118891 [2] NCCL INFO comm 0x55d9f39299b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118799:118895 [3] NCCL INFO comm 0x5617814127f0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x2d7b59e9603b1325 - Init COMPLETE
ik1-02:118797:118897 [1] NCCL INFO comm 0x563beef52860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x2d7b59e9603b1325 - Init COMPLETE

ik1-02:118800:118800 [4] enqueue.cc:1085 NCCL WARN Cuda failure 'misaligned address'
ik1-02:118800:118800 [4] NCCL INFO group.cc:162 -> 1
ik1-02:118800:118800 [4] NCCL INFO group.cc:339 -> 1
ik1-02:118800:118800 [4] NCCL INFO group.cc:418 -> 1
ik1-02:118800:118800 [4] NCCL INFO group.cc:95 -> 1
Traceback (most recent call last):
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/main.py", line 14, in <module>
    main()
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/main.py", line 9, in main
    all_reduce(test_tensor)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 49, in _get_msg_dict
    "args": f"{args}, {kwargs}",
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/_tensor.py", line 461, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 677, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 597, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f54434ced87 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f544347f75f in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f544359f8a8 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d686 (0x7f544356a686 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f744 (0x7f544356c744 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1fb6d (0x7f544356cb6d in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x540260 (0x7f5441ecb260 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x649bf (0x7f54434b39bf in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f54434acc8b in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f54434ace39 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x802be8 (0x7f544218dbe8 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f544218df66 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x150660 (0x559fb86d6660 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #13: <unknown function> + 0x164518 (0x559fb86ea518 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #14: <unknown function> + 0x164545 (0x559fb86ea545 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #15: <unknown function> + 0x12815f (0x559fb86ae15f in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #16: PyDict_SetItemString + 0xa3 (0x559fb86b21f3 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #17: <unknown function> + 0x263ce7 (0x559fb87e9ce7 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #18: Py_FinalizeEx + 0x148 (0x559fb87e64c8 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #19: Py_RunMain + 0x173 (0x559fb87d7913 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #20: Py_BytesMain + 0x2d (0x559fb87ae02d in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)
frame #21: <unknown function> + 0x29d90 (0x7f5444242d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: __libc_start_main + 0x80 (0x7f5444242e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x25 (0x559fb87adf25 in /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3)

ik1-02:118801:118947 [5] NCCL INFO [Service thread] Connection closed by localRank 4
ik1-02:118799:118953 [3] NCCL INFO [Service thread] Connection closed by localRank 4
ik1-02:118796:118948 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[2024-02-07 11:09:23,719] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118796 closing signal SIGTERM
[2024-02-07 11:09:23,719] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118797 closing signal SIGTERM
[2024-02-07 11:09:23,720] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118798 closing signal SIGTERM
[2024-02-07 11:09:23,720] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118799 closing signal SIGTERM
[2024-02-07 11:09:23,721] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118801 closing signal SIGTERM
[2024-02-07 11:09:23,721] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118802 closing signal SIGTERM
[2024-02-07 11:09:23,721] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 118803 closing signal SIGTERM
[2024-02-07 11:09:25,801] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 118800) of binary: /mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/python3
Traceback (most recent call last):
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/nvme/vfm/alexis/sakura-bug-repro/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
main.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-07_11:09:23
  host      : ik1-02
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 118800)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 118800
=======================================================

Versions

$ python3 collect_env.py Collecting environment information... PyTorch version: 2.2.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 545.23.08 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 112 On-line CPU(s) list: 0-111 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 1 Core(s) per socket: 56 Socket(s): 2 Stepping: 8 Frequency boost: enabled CPU max MHz: 2001.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 5.3 MiB (112 instances) L1i cache: 3.5 MiB (112 instances) L2 cache: 224 MiB (112 instances) L3 cache: 210 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-55 NUMA node1 CPU(s): 56-111 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.2.0 [pip3] torchvision==0.17.0 [pip3] triton==2.2.0 [conda] Could not collect

cc @ptrblck

lijing1996 commented 3 months ago

same problem with cuda 12.2

alexisVallet commented 3 months ago

If that may help other people encountering this issue. I encounter this issue when training using DistributedDataParallel when using bfloat16 weights. The best workaround I have found is to convert the gradients to float32 before all_reduce, using code like this:

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

# Based on code for:
# https://pytorch.org/docs/stable/ddp_comm_hooks.html#torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook
def convert_grad_float32(
    process_group: Optional[dist.ProcessGroup], bucket: dist.GradBucket
) -> torch.futures.Future[torch.Tensor]:
    group_to_use = (
        process_group if process_group is not None else dist.group.WORLD
    )
    world_size = group_to_use.size()
    tensor = bucket.buffer()
    orig_dtype = tensor.dtype
    if orig_dtype == torch.bfloat16:
        tensor = tensor.to(torch.float32)
    tensor.div_(world_size)

    fut = dist.all_reduce(
        tensor, group=group_to_use, async_op=True
    ).get_future()

    def decode(fut):
        out_tensor = bucket.buffer()
        out_tensor.copy_(fut.value()[0])
        return out_tensor

    return fut.then(decode)

model_dist = DistributedDataParallel(model)
model_dist.register_comm_hook(None, convert_grad_float32)
Yangruipis commented 2 weeks ago

same issue here, version: cuda==12.2 torch==2.2.2 , only occurs when training with DDP and 4 or 8 or more GPUs

so I guess if there is some alignment issues for bf16 and specific input shapes, like this issue perhaps: https://github.com/Dao-AILab/flash-attention/issues/289 since DDP training with bf16 is a very common user case but this issue is reported not so often

SCZwangxiao commented 1 week ago

Same problem. It probably has something to do with hardware, since I met this problem when switching from A100 to H100 with code unchanged.