microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.25k stars 4.08k forks source link

[BUG] "Bus error: nonexistent physical address" when training on 8 GPUs using NCCL #2638

Closed BioGeek closed 1 year ago

BioGeek commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

A distributed training run on 8 GPUs using NCCL failed after 11 hours with:

[2022-12-21 10:42:45,191] [INFO] [logging.py:68:log_dist] [Rank 0] step=80000, skipped=0, lr=[0.0003], mom=[[0.9, 0.999]]
[2022-12-21 10:42:45,192] [INFO] [timer.py:198:stop] 0/160000, RunningAvgSamplesPerSec=640.7417820778238, CurrSamplesPerSec=481.0283328405415, MemAllocated=0.19GB, MaxMemAllocated=11.27GB
[experiment-3fce9b69-1446-worker-0:2121 :0:7025] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 7025) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x0000000000068d7c ncclGroupEnd() ???:0
3 0x000000000005de3d ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
Epoch 10/5961 : || 53.69% [9007/16777 47:52<41:17 steps : 159,980, loss : 1.086, sec/batch : 0.087] [2022-12-21 10:43:17,573] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2118
[2022-12-21 10:43:18,777] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2119
[2022-12-21 10:43:19,767] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2120
[2022-12-21 10:43:20,943] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2121
[2022-12-21 10:43:20,943] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2122
[2022-12-21 10:43:21,943] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2123
[2022-12-21 10:43:22,983] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2124
[2022-12-21 10:43:23,943] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2125

To Reproduce Steps to reproduce the behavior:

  1. I am using the deepspeed/deepspeed:v072_torch112_cu117 Docker image
  2. Distributed training is set with deepspeed.init_distributed(dist_backend="nccl")
  3. I have now set the NCCL_DEBUG=INFO environment variable and restarted the training. I will report back if the problem appears again and I have more debugging information.

Expected behavior A clear and concise description of what you expected to happen.

Training continues.

ds_report output Please run ds_report to give us details about your setup.

#22 3.568 DeepSpeed C++/CUDA extension op report
#22 3.568 --------------------------------------------------
#22 3.568 NOTE: Ops not installed will be just-in-time (JIT) compiled at
#22 3.568 runtime if needed. Op compatibility means that your system
#22 3.568 meet the required dependencies to JIT install the op.
#22 3.568 --------------------------------------------------
#22 3.568 JIT compiled ops requires ninja
#22 3.568 ninja .................. [OKAY]
#22 3.568 --------------------------------------------------
#22 3.568 op name ................ installed .. compatible
#22 3.568 --------------------------------------------------
#22 3.568 cpu_adam ............... [NO] ....... [OKAY]
#22 3.568 cpu_adagrad ............ [NO] ....... [OKAY]
#22 3.568 fused_adam ............. [NO] ....... [OKAY]
#22 3.568 fused_lamb ............. [NO] ....... [OKAY]
#22 3.568 sparse_attn ............ [NO] ....... [OKAY]
#22 3.568 transformer ............ [NO] ....... [OKAY]
#22 3.568 stochastic_transformer . [NO] ....... [OKAY]
#22 3.568 async_io ............... [NO] ....... [OKAY]
#22 3.568 utils .................. [NO] ....... [OKAY]
#22 3.568 quantizer .............. [NO] ....... [OKAY]
#22 3.568 transformer_inference .. [NO] ....... [OKAY]
#22 3.568 --------------------------------------------------
#22 3.568 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#22 3.568 DeepSpeed general environment info:
#22 3.568 torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
#22 3.568 torch version .................... 1.12.0a0+8a1a93a
#22 3.568 torch cuda version ............... 11.7
#22 3.568 torch hip version ................ None
#22 3.568 nvcc version ..................... 11.7
#22 3.568 deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
#22 3.568 deepspeed info ................... 0.7.2, unknown, unknown
#22 3.568 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Launching with the following command:

    deepspeed \
        --num_nodes=1\
        --num_gpus=8 \
        ./src/train.py \
        checkpoint_path=/runs \
        train_data_path=./data/dataset_v1/ \
        batch_size=12 \
        distributed.n_gpus_per_node=8 \
        --deepspeed \
        --deepspeed_config=deepspeed_cfg.json

Docker context Are you using a specific docker image that you can share?

deepspeed/deepspeed:v072_torch112_cu117

tjruwase commented 1 year ago

@BioGeek, thanks for reporting the error. Looking forward to updates on your run with NCCL_DEBUG=INFO.

BioGeek commented 1 year ago
Here is the output with NCCL_DEBUG=INFO:

``` Epoch 1/5961 : |--| 11.86% [1990/16777 14:32<1:48:03 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.87% [1991/16777 14:32<1:48:01 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.87% [1992/16777 14:33<1:47:59 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.88% [1993/16777 14:33<1:47:57 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.89% [1994/16777 14:33<1:47:55 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.89% [1995/16777 14:33<1:47:53 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.90% [1996/16777 14:33<1:47:51 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.90% [1997/16777 14:34<1:47:48 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.91% [1998/16777 14:34<1:47:47 steps : 1,000, loss : 2.727, sec/batch : 0.145] Epoch 1/5961 : |--| 11.92% [1999/16777 14:34<1:47:45 steps : 1,000, loss : 2.727, sec/batch : 0.145] [2022-12-22 01:07:48,305] [INFO] [logging.py:68:log_dist] [Rank 0] step=1000, skipped=0, lr=[2.997e-05], mom=[[0.9, 0.999]] [2022-12-22 01:07:48,306] [INFO] [timer.py:198:stop] 0/2000, RunningAvgSamplesPerSec=676.8073846313383, CurrSamplesPerSec=589.0159006704547, MemAllocated=0.19GB, MaxMemAllocated=11.21GB INFO:root:[RANK 4] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 5] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 3] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 2] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 1] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 7] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 0] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} INFO:root:[RANK 6] Saving checkpoint locally to /runs/ckpt_00002000 with client_sd: {'steps': 2000, 'last_epoch': 0, 'cfg_yaml': "distributed:\n dist_backend: nccl\n dist_url: tcp://localhost:54321\n n_gpus_per_node: 8\nmodel_cfg:\n vocab_size: 24\n max_len: 20\n dim: 256\n nheads: 16\n layers: 6\n input_size: 4\n dropout: 0.1\ndevice: cuda\nseed: 1775\nbatch_size: 12\nnum_workers: 16\nsummary_interval: 25\ncheckpoint_interval: 2000\nstdout_interval: 100\nvalidation_interval: 1000\nmax_steps: 100000000\ncheckpoint_path: /runs\ntrain_data_path: ./data/denovo_dataset_v1/\nresume_checkpoint: ''\ntest_split_seed: 100\ntest_split: 0.1\nvalid_split: 0.1\nvalid_proportion: 0.1\n"} Epoch 1/5961 : |--| 11.92% [2000/16777 14:34<1:47:43 steps : 1,000, loss : 2.727, sec/batch : 0.145] experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Setting affinity for GPU 1 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Setting affinity for GPU 2 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Setting affinity for GPU 0 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,0000ffff,ff000000 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 00/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 01/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 02/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 03/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 04/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 05/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 06/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 07/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 08/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 09/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 10/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 11/12 : 0 1 2 3 4 5 6 7 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 00 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 00 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 00 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 00 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 00 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 00 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 00 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 01 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 00 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 01 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 01 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 01 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 01 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 01 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 01 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 02 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 01 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 02 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 02 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 02 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 02 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 02 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 02 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 03 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 02 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 03 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 03 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 03 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 03 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 03 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 03 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 04 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 03 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 04 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 04 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 04 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 04 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 04 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 04 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 05 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 04 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 05 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 05 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 05 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 05 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 05 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 05 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 06 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 05 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 06 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 06 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 06 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 06 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 06 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 06 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 07 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 06 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 07 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 07 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 07 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 07 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 07 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 07 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 08 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 07 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 08 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 08 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 08 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 08 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 08 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 08 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 09 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 08 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 09 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 09 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 09 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 09 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 09 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 09 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 10 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 09 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 10 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 10 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 10 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 10 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 10 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 10 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 11 : 1[b9000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 10 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 11 : 5[e2000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 11 : 4[e0000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 11 : 6[e5000] -> 7[e7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 11 : 3[be000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 11 : 7[e7000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 11 : 2[bc000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Channel 11 : 0[b7000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 00 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Connected all rings experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 01 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 02 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 03 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 04 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 05 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 06 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 07 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 08 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 09 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 10 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Channel 11 : 7[e7000] -> 6[e5000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 00 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 00 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 00 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 00 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 00 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 00 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 01 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 01 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 01 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 01 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 01 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 01 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 02 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 02 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 02 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 02 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 02 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 02 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 03 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 03 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 03 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 03 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 03 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 03 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 04 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 04 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 04 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 04 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 04 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 04 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 05 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 05 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 05 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 05 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 05 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 05 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 06 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 06 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 06 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 06 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 06 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 06 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 07 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 07 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 07 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 07 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 07 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 07 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 08 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 08 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 08 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 08 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 08 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 08 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 09 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 09 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 09 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 09 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 09 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 09 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 10 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 10 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 10 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 10 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 10 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 10 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Channel 11 : 5[e2000] -> 4[e0000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Channel 11 : 3[be000] -> 2[bc000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Channel 11 : 2[bc000] -> 1[b9000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Channel 11 : 4[e0000] -> 3[be000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Channel 11 : 6[e5000] -> 5[e2000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Channel 11 : 1[b9000] -> 0[b7000] via P2P/IPC experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2125:6172 [7] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2117:6173 [0] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO Connected all trees experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512 experiment-1eab9c37-70ab-worker-0:2122:6169 [5] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2123:6174 [6] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2120:6167 [3] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2118:6171 [1] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2121:6168 [4] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer experiment-1eab9c37-70ab-worker-0:2119:6170 [2] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer [experiment-1eab9c37-70ab-worker-0:2119 :0:6178] Caught signal 7 (Bus error: nonexistent physical address) [experiment-1eab9c37-70ab-worker-0:2121 :0:6175] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 6178) ==== 0 0x0000000000014420 __funlockfile() ???:0 ==== backtrace (tid: 6175) ==== 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x0000000000068d7c ncclGroupEnd() ???:0 3 0x000000000005de3d ncclGroupEnd() ???:0 0 0x0000000000014420 __funlockfile() ???:0 4 0x0000000000008609 start_thread() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0 5 0x000000000011f133 clone() ???:0 ================================= 2 0x0000000000068d7c ncclGroupEnd() ???:0 3 0x000000000005de3d ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= [2022-12-22 01:07:50,325] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2117 [2022-12-22 01:07:51,328] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2118 [2022-12-22 01:07:52,476] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2119 [2022-12-22 01:07:52,477] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2120 [2022-12-22 01:07:53,517] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2121 [2022-12-22 01:07:53,517] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2122 [2022-12-22 01:07:55,158] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2123 [2022-12-22 01:07:56,198] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2125 [2022-12-22 01:07:57,038] [ERROR] [launch.py:292:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', './dtu_denovo_sequencing/train.py', '--local_rank=7', 'train_data_path=./data/denovo_dataset_v1/', 'batch_size=12', 'distributed.n_gpus_per_node=8', '--deepspeed', '--deepspeed_config=deepspeed_cfg.json'] exits with return code = -7 /opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022 warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning) /opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index ```

I also added some extra logging and I could see that the error occured during checkpointing. I am running the code in a Docker container on a Kubernetes cluster and I first save the checkpoints locally inside the Docker container and then upload them to an S3 bucket (*). There was a bug in that code. after fixing that bug, I can now train again without any crashes. But it would be handy that I could see a python stack trace of what went wrong instead of this weird NCCL output.

(*) As for as I can see, logging checkpoints directly to an S3 bucket is not supported by DeepSpeed, but I am happy to be corrected if that assumption is wrong.

felifri commented 1 year ago

How did you find the problem in the end, as I'm stuck with the same issue?

tjruwase commented 1 year ago

@BioGeek, thanks for sharing your log and that you fixed the issue.

(*) As for as I can see, logging checkpoints directly to an S3 bucket is not supported by DeepSpeed, but I am happy to be corrected if that assumption is wrong.

Can you clarify what you mean by logging checkpoints directly to an S3 bucket? Do you mean specifying S3 bucket as a destination folder to engine.save_checkpoint()? Either way, the answer to your question is probably no, as I don't believe we have ever tested such support.

BioGeek commented 1 year ago

How did you find the problem in the end, as I'm stuck with the same issue?

@felifri I added extensive logging so that I could see up to which point the code ran before it failed

Can you clarify what you mean by logging checkpoints directly to an S3 bucket? Do you mean specifying S3 bucket as a destination folder to engine.save_checkpoint()?

@tjruwase Yes, indeed. I would like to be able to do something like: engine.save_checkpoint(save_dir='s3://my-bucket/checkpoints/) , but that doesn't work because save_checkpoint for example calls os.makedirs(save_dir, exist_ok=True) which doesn't make sense for an S3 URI.

To workaround I have now looks something like this (can probably be improved):

            client_state = {}

            if steps % cfg.checkpoint_interval == 0 and steps != 0:
                checkpoint_tag = f"ckpt_{steps:08d}"

                client_state["steps"] = steps
                client_state["last_epoch"] = epoch
                client_state["cfg_yaml"] = OmegaConf.to_yaml(cfg)

                local_dir = os.path.join(cfg.checkpoint_path, checkpoint_tag)
                if not os.path.exists(local_dir):
                    logging.info(f"[RANK {rank}] Creating {local_dir}")
                    os.makedirs(local_dir, exist_ok=True)

                # First save checkpoint locally, must be done on all ranks
                # Hangs to synchronise all threads.
                # Also calls .barrier() at the end to ensure all threads are done writing.
                logging.info(f"[RANK {rank}] Saving checkpoint locally to {local_dir} with client_state: {client_state}")
                model_engine.save_checkpoint(
                    save_dir=cfg.checkpoint_path,
                    tag=checkpoint_tag,
                    client_state=client_state,
                )
                logging.info(f"[RANK {rank}] Saved checkpoint locally to {local_dir}")

                if rank == 0:
                    # now upload to S3
                    logging.info(f"[RANK {rank}] Creating s3fs.core.S3FileSystem")
                    s3 = s3fs.core.S3FileSystem(
                        client_kwargs={"endpoint_url": os.environ.get("S3_ENDPOINT")}
                    )
                    logging.info(f" [RANK {rank}] Created s3fs.core.S3FileSystem: {s3}")

                    # Prepare for checkpoint load by ensuring all parameters are partitioned
                    # https://github.com/microsoft/DeepSpeed/blob/6273dffc2f192275a08268b683c309a328b52191/deepspeed/runtime/engine.py#L2752
                    if model_engine.zero_optimization_partition_weights():
                        model_engine.optimizer.checkpoint_event_prologue()

                    # https://github.com/microsoft/DeepSpeed/blob/6273dffc2f192275a08268b683c309a328b52191/deepspeed/runtime/engine.py#L2789
                    ckpt_list = model_engine._get_all_ckpt_names(
                        cfg.checkpoint_path, checkpoint_tag
                    )
                    logging.info(f"ckpt_list: {ckpt_list}")
                    for local_chkpt_path in ckpt_list:
                        relative_path = Path(local_chkpt_path).relative_to(cfg.checkpoint_path)
                        s3_chkpt_path = f"{os.environ['S3_BUCKET']}{relative_path}"
                        logging.info(f"s3_chkpt_path: {s3_chkpt_path}")

                        with open(local_chkpt_path, "rb") as local_fp, s3.open(
                            s3_chkpt_path, "wb"
                        ) as remote_fp:
                            remote_fp.write(local_fp.read())
                            logging.info(f"Wrote {local_chkpt_path} to {s3_chkpt_path}")
tjruwase commented 1 year ago

@BioGeek, thanks for the clarification. DeepSpeed uses torch.save() internally to save checkpoints. Do you know if torch.save() can directly write into s3?

BioGeek commented 1 year ago

Not directly. torch.save() expects a file-like object (has to implement write and flush) or a string or os.PathLike object containing a file name.

There are workarounds like:

buffer = io.BytesIO()
torch.save(model, buffer)
s3.put_object(Bucket="my-bucket/checkpoints", Key=output_model_file, Body=buffer.getvalue())

but I don't think that would help here.

felifri commented 1 year ago

This is what I receive when I run NCCL_DEBUG=INFO deepspeed --num_gpus 8 main_deepspeed.py --model_name microsoft/bloom-deepspeed-inference-int8 --dtype int8

console output ``` c664db21f8dd:90416:93283 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 01/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 03/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 04/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 04/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 04/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 04/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 04/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 05/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 05/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 05/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 05/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 05/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 06/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 06/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 06/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 06/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 06/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 07/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 07/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 07/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 07/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 07/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 08/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 08/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 08/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 08/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 08/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 09/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 09/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 09/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 09/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 09/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 10/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 10/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 10/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 10/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 10/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 11/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 11/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 11/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 11/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 11/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 12/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 12/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 12/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 12/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 12/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 13/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 13/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 13/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 13/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 13/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 14/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 14/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 14/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 14/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 14/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 15/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 15/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 15/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 15/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 15/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 16/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 16/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 16/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 16/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 16/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 17/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 17/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 17/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 17/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 17/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 18/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 18/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 18/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 18/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 18/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 19/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 19/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 19/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 19/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 19/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 20/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 20/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 20/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 20/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 20/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 21/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 21/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 21/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 21/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 21/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 22/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 22/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 22/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 22/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 22/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 23/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 23/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 23/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 23/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Channel 23/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Connected all rings c664db21f8dd:90417:93281 [3] NCCL INFO Connected all rings c664db21f8dd:90420:93277 [5] NCCL INFO Connected all rings c664db21f8dd:90422:93275 [6] NCCL INFO Connected all rings c664db21f8dd:90416:93283 [2] NCCL INFO Connected all rings c664db21f8dd:90414:93273 [0] NCCL INFO Connected all rings c664db21f8dd:90415:93287 [1] NCCL INFO Connected all rings c664db21f8dd:90424:93279 [7] NCCL INFO Connected all rings c664db21f8dd:90424:93279 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 01/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 03/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 04/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90424:93279 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 04/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 04/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 05/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 05/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 05/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 04/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 06/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 06/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 06/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 05/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 07/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 07/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 07/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 06/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 08/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 08/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 08/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 07/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 09/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 09/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 09/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 08/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 10/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 10/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 10/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 09/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 11/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 11/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 11/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 10/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 12/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 12/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 12/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 11/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 13/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 13/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 13/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 12/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 14/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 14/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 14/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 13/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 15/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 15/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 15/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 14/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 16/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 16/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 16/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 15/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 17/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 17/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 17/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 16/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 18/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 18/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 18/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 17/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 19/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 19/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 19/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 18/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 20/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 20/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 20/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 19/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 21/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 21/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 21/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 20/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 22/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 22/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 22/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 21/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:93285 [4] NCCL INFO Channel 23/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:93281 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:93277 [5] NCCL INFO Channel 23/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90416:93283 [2] NCCL INFO Channel 23/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 22/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90415:93287 [1] NCCL INFO Channel 23/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:93275 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93273 [0] NCCL INFO Connected all trees c664db21f8dd:90414:93273 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90414:93273 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90424:93279 [7] NCCL INFO Connected all trees c664db21f8dd:90424:93279 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90424:93279 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90418:93285 [4] NCCL INFO Connected all trees c664db21f8dd:90418:93285 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90418:93285 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90417:93281 [3] NCCL INFO Connected all trees c664db21f8dd:90417:93281 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90417:93281 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90416:93283 [2] NCCL INFO Connected all trees c664db21f8dd:90416:93283 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90416:93283 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90420:93277 [5] NCCL INFO Connected all trees c664db21f8dd:90420:93277 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90420:93277 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90422:93275 [6] NCCL INFO Connected all trees c664db21f8dd:90422:93275 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90422:93275 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90415:93287 [1] NCCL INFO Connected all trees c664db21f8dd:90415:93287 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90415:93287 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90422:93275 [6] NCCL INFO comm 0xf859a40 rank 6 nranks 8 cudaDev 6 busId b7000 - Init COMPLETE c664db21f8dd:90414:93273 [0] NCCL INFO comm 0xead2690 rank 0 nranks 8 cudaDev 0 busId 7000 - Init COMPLETE c664db21f8dd:90424:93279 [7] NCCL INFO comm 0xf51f540 rank 7 nranks 8 cudaDev 7 busId bd000 - Init COMPLETE c664db21f8dd:90415:93287 [1] NCCL INFO comm 0xee80ea0 rank 1 nranks 8 cudaDev 1 busId f000 - Init COMPLETE c664db21f8dd:90416:93283 [2] NCCL INFO comm 0xfe1d4f0 rank 2 nranks 8 cudaDev 2 busId 47000 - Init COMPLETE c664db21f8dd:90417:93281 [3] NCCL INFO comm 0x1353a340 rank 3 nranks 8 cudaDev 3 busId 4e000 - Init COMPLETE c664db21f8dd:90420:93277 [5] NCCL INFO comm 0x10deaeb0 rank 5 nranks 8 cudaDev 5 busId 90000 - Init COMPLETE c664db21f8dd:90418:93285 [4] NCCL INFO comm 0xefa9140 rank 4 nranks 8 cudaDev 4 busId 87000 - Init COMPLETE /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( /usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0. You can remove this warning by passing 'storage_options=fs.storage_options' instead. warnings.warn( Parameter 'function'= at 0x7f5bf10f7790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%|▎ | 100/39440 [00:01<07:14, 90.56ex/s] Parameter 'function'= at 0x7fc88fb58790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%|▍ | 180/39440 [00:02<07:11, 91.01ex/s] Parameter 'function'= at 0x7f3d416bd790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%| | 31/39440 [00:00<08:05, 81.23ex/s]Parameter 'function'= at 0x7f7ebbcef790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%| | 30/39440 [00:00<08:13, 79.87ex/s]Parameter 'function'= at 0x7ff145c6b790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%| | 40/39440 [00:00<07:48, 84.05ex/s]Parameter 'function'= at 0x7f9af4ac0790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 0%|▎ | 100/39440 [00:01<07:13, 90.71ex/s]Parameter 'function'= at 0x7fcc737ac790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 3%|██▌ | 990/39440 [00:10<07:00, 91.38ex/s]Parameter 'function'= at 0x7f623ca9d790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:15<00:00, 90.61ex/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:14<00:00, 90.67ex/s] 97%|███████████████████████████████████████████████████████████████████████████████████████████████▌ | 38088/39440 [07:02<00:14, 90.67ex/s] *** Starting to generate 8 tokens with bs=5 Generate args {'max_new_tokens': 8, 'do_sample': False, 'min_length': 4, 'pad_token_id': 2} 100%|██████████████████████████████████████████████████████████████████████████████████████████████████▋| 39319/39440 [07:13<00:01, 91.37ex/s] ------------------------------------------------------ Free memory : 51.334167 (GigaBytes) Total memory: 79.346863 (GigaBytes) Requested memory: 4.375000 (GigaBytes) Setting maximum total tokens (input + output) to 1024 ------------------------------------------------------ 99%|██████████████████████████████████████████████████████████████████████████████████████████████████▎| 39162/39440 [07:12<00:03, 90.40ex/s] c664db21f8dd:90414:93817 [0] NCCL INFO Using network Socket c664db21f8dd:90416:93818 [2] NCCL INFO Using network Socket 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:14<00:00, 90.69ex/s] 97%|███████████████████████████████████████████████████████████████████████████████████████████████▊ | 38168/39440 [07:03<00:14, 90.70ex/s] c664db21f8dd:90415:94075 [1] NCCL INFO Using network Socket 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:15<00:00, 90.61ex/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:15<00:00, 90.65ex/s] 97%|████████████████████████████████████████████████████████████████████████████████████████████████ | 38248/39440 [07:04<00:13, 90.92ex/s] c664db21f8dd:90422:94332 [6] NCCL INFO Using network Socket 97%|████████████████████████████████████████████████████████████████████████████████████████████████ | 38278/39440 [07:04<00:12, 91.01ex/s] c664db21f8dd:90417:94589 [3] NCCL INFO Using network Socket 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:15<00:00, 90.50ex/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████▉| 39412/39440 [07:15<00:00, 91.16ex/s] c664db21f8dd:90420:94846 [5] NCCL INFO Using network Socket 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:15<00:00, 90.50ex/s] 97%|████████████████████████████████████████████████████████████████████████████████████████████████▍ | 38408/39440 [07:06<00:11, 90.96ex/s] c664db21f8dd:90424:95103 [7] NCCL INFO Using network Socket 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 39440/39440 [07:17<00:00, 90.09ex/s] 0%| | 0/7888 [00:004->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 c664db21f8dd:90422:94332 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90417:94589 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 c664db21f8dd:90424:95103 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 c664db21f8dd:90416:93818 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 c664db21f8dd:90415:94075 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 c664db21f8dd:90414:93817 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 c664db21f8dd:90418:95360 [4] NCCL INFO Channel 00/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 00/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 00/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 00/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 01/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 01/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 03/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 04/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 04/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 04/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 04/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 05/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 05/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 05/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 05/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 06/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 04/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 06/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 06/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 06/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 07/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 05/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 07/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 07/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 07/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 08/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 06/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 08/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 08/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 08/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 09/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 07/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 09/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 09/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 09/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 10/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 08/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 10/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 10/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 10/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 11/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 09/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 11/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 11/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 11/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 12/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 10/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 12/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 12/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 12/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 13/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 11/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 13/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 13/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 13/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 14/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 12/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 14/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 14/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 14/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 15/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 13/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 15/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 15/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 15/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 16/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 14/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 16/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 16/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 16/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 17/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 15/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 17/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 17/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 17/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 18/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 16/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 18/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 18/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 18/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 19/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 17/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 19/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 19/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 19/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 20/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 18/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 20/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 20/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 20/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 21/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 19/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 21/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 21/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 21/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 22/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 20/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 22/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 22/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 22/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 23/0 : 4[87000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 21/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 23/0 : 5[90000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 23/0 : 1[f000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 23/0 : 2[47000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 22/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90414:93817 [0] NCCL INFO Channel 23/0 : 0[7000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Connected all rings c664db21f8dd:90422:94332 [6] NCCL INFO Connected all rings c664db21f8dd:90424:95103 [7] NCCL INFO Connected all rings c664db21f8dd:90424:95103 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Connected all rings c664db21f8dd:90418:95360 [4] NCCL INFO Connected all rings c664db21f8dd:90414:93817 [0] NCCL INFO Connected all rings c664db21f8dd:90417:94589 [3] NCCL INFO Connected all rings c664db21f8dd:90424:95103 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Connected all rings c664db21f8dd:90424:95103 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 01/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 04/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 03/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 05/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 04/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 04/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 06/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 05/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 05/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 07/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 06/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 06/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 08/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 04/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 07/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 07/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 09/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 05/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 08/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 08/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 10/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 06/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 09/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 09/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 11/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 07/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 10/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 10/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 12/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 08/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 11/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 11/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 13/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 09/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 12/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 12/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 14/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 10/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 13/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 13/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 15/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 11/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 14/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 14/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 16/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 12/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 15/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 15/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 17/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 13/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 16/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 16/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 18/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 14/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 17/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 17/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 19/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 15/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 18/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 18/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 20/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 16/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 19/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 19/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 21/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 17/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 20/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 20/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 22/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 18/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 21/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 21/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90420:94846 [5] NCCL INFO Channel 23/0 : 5[90000] -> 4[87000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 19/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90422:94332 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 5[90000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 22/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 22/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 20/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90418:95360 [4] NCCL INFO Channel 23/0 : 4[87000] -> 3[4e000] via P2P/IPC/read c664db21f8dd:90416:93818 [2] NCCL INFO Channel 23/0 : 2[47000] -> 1[f000] via P2P/IPC/read c664db21f8dd:90417:94589 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 2[47000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 21/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 22/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90415:94075 [1] NCCL INFO Channel 23/0 : 1[f000] -> 0[7000] via P2P/IPC/read c664db21f8dd:90424:95103 [7] NCCL INFO Connected all trees c664db21f8dd:90424:95103 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90424:95103 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90424:0:95364] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90414:93817 [0] NCCL INFO Connected all trees c664db21f8dd:90414:93817 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90414:93817 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90414:0:95363] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 95364) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95363) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= c664db21f8dd:90422:94332 [6] NCCL INFO Connected all trees c664db21f8dd:90422:94332 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90422:94332 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90422:0:95362] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90420:94846 [5] NCCL INFO Connected all trees c664db21f8dd:90420:94846 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90420:94846 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90420:0:95366] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90417:94589 [3] NCCL INFO Connected all trees c664db21f8dd:90417:94589 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90417:94589 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90418:95360 [4] NCCL INFO Connected all trees c664db21f8dd:90418:95360 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [c664db21f8dd:90417:0:95367] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90418:95360 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90418:0:95361] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90416:93818 [2] NCCL INFO Connected all trees c664db21f8dd:90416:93818 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 c664db21f8dd:90416:93818 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer c664db21f8dd:90415:94075 [1] NCCL INFO Connected all trees c664db21f8dd:90415:94075 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [c664db21f8dd:90416:0:95365] Caught signal 7 (Bus error: nonexistent physical address) c664db21f8dd:90415:94075 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer [c664db21f8dd:90415:0:95368] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 95362) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95366) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95361) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95367) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95368) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= ==== backtrace (tid: 95365) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bbc0 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000006b246 ncclGroupEnd() ???:0 4 0x0000000000008609 start_thread() ???:0 5 0x000000000011f133 clone() ???:0 ================================= [2023-01-12 09:21:49,201] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90414 [2023-01-12 09:21:49,255] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90415 [2023-01-12 09:21:50,108] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90416 [2023-01-12 09:21:50,320] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90417 [2023-01-12 09:21:50,320] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90418 [2023-01-12 09:21:50,773] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90420 [2023-01-12 09:21:50,774] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90422 [2023-01-12 09:21:50,774] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90424 [2023-01-12 09:21:50,774] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'main_crosslingual_deepspeed.py', '--local_rank=7', '--model_name', 'microsoft/bloom-deepspeed-inference-int8', '--dtype', 'int8'] exits with return code = -7 ```

I get the same problem if I use another model version like bigscience/bloom

However, when I run NCCL_DEBUG=INFO deepspeed --num_gpus 4 main_deepspeed.py --model_name microsoft/bloom-deepspeed-inference-int8 --dtype int8, i.e. change the number of GPUs to 4 (no matter which 4 of my 8 available I use), it works perfectly fine.

Any ideas? I'm stuck here, and two months ago everything worked fine.

tjruwase commented 1 year ago

@felifri, thanks for sharing your log. It looks like you are doing inference rather than training. Please open a new issue with an 'inference' tag and the right folks will follow up.

tjruwase commented 1 year ago

Not directly. torch.save() expects a file-like object (has to implement write and flush) or a string or os.PathLike object containing a file name.

There are workarounds like:

buffer = io.BytesIO()
torch.save(model, buffer)
s3.put_object(Bucket="my-bucket/checkpoints", Key=output_model_file, Body=buffer.getvalue())

but I don't think that would help here.

@BioGeek, thanks for sharing this context. Do you mind opening an 'enhancement' issue request for this? Some initial thoughts for providing this support might be to extend engine.save_checkpoint() similar to torch.save() such that the path argument could optionally be an object or a function like your workaround above. DeepSpeed could then internally do the right thing. If you are interested, we can iterate a design in the new issue. Thanks!

BioGeek commented 1 year ago

Do you mind opening an 'enhancement' issue request for this?

@tjruwase Done. See #2701

MrRace commented 10 months ago

@tjruwase follow the guideline DeepSpeed-Chat, I try to use 8 GPUs to train models, here is my CMD:

python3 e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node

when train comes same error, here is the output with NCCL_DEBUG=INFO:

cf0b62962198:378932:378932 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378932:378932 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378932:378932 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378932:378932 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378932:378932 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378932:378932 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
cf0b62962198:378932:380974 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378932:380974 [0] NCCL INFO P2P plugin IBext
cf0b62962198:378932:380974 [0] NCCL INFO NET/IB : No device found.
cf0b62962198:378932:380974 [0] NCCL INFO NET/IB : No device found.
cf0b62962198:378932:380974 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378932:380974 [0] NCCL INFO Using network Socket
cf0b62962198:378940:378940 [6] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378933:378933 [1] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378938:378938 [5] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378934:378934 [2] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378942:378942 [7] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378936:378936 [4] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378935:378935 [3] NCCL INFO cudaDriverVersion 11070
cf0b62962198:378934:378934 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378938:378938 [5] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378935:378935 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378933:378933 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378934:378934 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378938:378938 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378934:378934 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378938:378938 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378934:378934 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378938:378938 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378934:378934 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378938:378938 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378935:378935 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378935:378935 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378935:378935 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378935:378935 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378940:378940 [6] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378933:378933 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378933:378933 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378933:378933 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378933:378933 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378940:378940 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378940:378940 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378940:378940 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378940:378940 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378936:378936 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378936:378936 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378936:378936 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378936:378936 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378936:378936 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378942:378942 [7] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
cf0b62962198:378942:378942 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
cf0b62962198:378942:378942 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
cf0b62962198:378942:378942 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
cf0b62962198:378942:378942 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
cf0b62962198:378934:380975 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378934:380975 [2] NCCL INFO P2P plugin IBext
cf0b62962198:378938:380976 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378938:380976 [5] NCCL INFO P2P plugin IBext
cf0b62962198:378934:380975 [2] NCCL INFO NET/IB : No device found.
cf0b62962198:378938:380976 [5] NCCL INFO NET/IB : No device found.
cf0b62962198:378935:380977 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378935:380977 [3] NCCL INFO P2P plugin IBext
cf0b62962198:378934:380975 [2] NCCL INFO NET/IB : No device found.
cf0b62962198:378938:380976 [5] NCCL INFO NET/IB : No device found.
cf0b62962198:378934:380975 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378935:380977 [3] NCCL INFO NET/IB : No device found.
cf0b62962198:378934:380975 [2] NCCL INFO Using network Socket
cf0b62962198:378938:380976 [5] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378938:380976 [5] NCCL INFO Using network Socket
cf0b62962198:378935:380977 [3] NCCL INFO NET/IB : No device found.
cf0b62962198:378935:380977 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378935:380977 [3] NCCL INFO Using network Socket
cf0b62962198:378933:380978 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378933:380978 [1] NCCL INFO P2P plugin IBext
cf0b62962198:378933:380978 [1] NCCL INFO NET/IB : No device found.
cf0b62962198:378933:380978 [1] NCCL INFO NET/IB : No device found.
cf0b62962198:378933:380978 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378933:380978 [1] NCCL INFO Using network Socket
cf0b62962198:378940:380979 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378940:380979 [6] NCCL INFO P2P plugin IBext
cf0b62962198:378940:380979 [6] NCCL INFO NET/IB : No device found.
cf0b62962198:378940:380979 [6] NCCL INFO NET/IB : No device found.
cf0b62962198:378940:380979 [6] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378940:380979 [6] NCCL INFO Using network Socket
cf0b62962198:378936:380980 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378936:380980 [4] NCCL INFO P2P plugin IBext
cf0b62962198:378936:380980 [4] NCCL INFO NET/IB : No device found.
cf0b62962198:378936:380980 [4] NCCL INFO NET/IB : No device found.
cf0b62962198:378936:380980 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378936:380980 [4] NCCL INFO Using network Socket
cf0b62962198:378942:380981 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
cf0b62962198:378942:380981 [7] NCCL INFO P2P plugin IBext
cf0b62962198:378942:380981 [7] NCCL INFO NET/IB : No device found.
cf0b62962198:378942:380981 [7] NCCL INFO NET/IB : No device found.
cf0b62962198:378942:380981 [7] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
cf0b62962198:378942:380981 [7] NCCL INFO Using network Socket
cf0b62962198:378936:380980 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378940:380979 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378934:380975 [2] NCCL INFO Setting affinity for GPU 2 to 03ffff,ffffffff,ffffffff
cf0b62962198:378938:380976 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378932:380974 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff,ffffffff,ffffffff
cf0b62962198:378942:380981 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378933:380978 [1] NCCL INFO Setting affinity for GPU 1 to 03ffff,ffffffff,ffffffff
cf0b62962198:378935:380977 [3] NCCL INFO Setting affinity for GPU 3 to 03ffff,ffffffff,ffffffff
cf0b62962198:378933:380978 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
cf0b62962198:378932:380974 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378938:380976 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
cf0b62962198:378932:380974 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378942:380981 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
cf0b62962198:378935:380977 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
cf0b62962198:378932:380974 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378936:380980 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
cf0b62962198:378934:380975 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
cf0b62962198:378932:380974 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378940:380979 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
cf0b62962198:378932:380974 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:380974 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
cf0b62962198:378933:380978 [1] NCCL INFO Channel 00/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 00/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 00/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 00/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 00/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 00/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 00/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 00/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 01/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 01/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 01/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 01/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 01/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 01/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 01/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 01/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 02/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 02/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 02/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 02/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 02/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 02/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 02/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 02/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 03/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 03/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 03/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 03/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 03/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 03/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 03/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 03/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 04/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 04/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 04/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 04/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 04/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 04/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 04/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 04/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 05/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 05/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 05/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 05/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 05/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 05/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 05/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 05/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 06/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 06/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 06/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 06/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 06/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 06/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 06/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 06/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 07/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 07/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 07/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 07/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 07/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 07/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 07/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 07/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 08/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 08/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 08/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 08/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 08/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 08/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 08/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 08/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 09/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 09/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 09/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 09/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 09/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 09/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 09/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 09/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 10/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 10/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 10/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 10/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 10/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 10/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 10/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 10/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 11/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 11/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 11/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 11/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 11/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 11/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 11/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 11/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 12/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 12/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 12/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 12/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 12/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 12/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 12/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 12/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 13/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 13/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 13/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 13/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 13/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 13/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 13/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 13/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 14/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 14/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 14/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 14/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 14/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 14/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 14/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 14/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 15/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 15/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 15/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 15/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 15/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 15/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 15/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 15/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 16/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 16/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 16/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 16/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 16/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 16/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 16/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 16/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 17/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 17/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 17/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 17/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 17/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 17/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 17/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 17/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 18/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 18/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 18/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 18/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 18/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 18/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 18/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 18/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 19/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 19/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 19/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 19/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 19/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 19/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 19/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 19/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 20/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 20/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 20/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 20/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 20/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 20/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 20/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 20/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 21/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 21/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 21/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 21/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 21/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 21/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 21/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 21/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 22/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 22/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 22/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 22/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 22/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 22/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 22/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 22/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 23/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 23/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 23/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Channel 23/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 23/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 23/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 23/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 23/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378932:380974 [0] NCCL INFO Connected all rings
cf0b62962198:378942:380981 [7] NCCL INFO Connected all rings
cf0b62962198:378942:380981 [7] NCCL INFO Channel 00/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Connected all rings
cf0b62962198:378938:380976 [5] NCCL INFO Connected all rings
cf0b62962198:378936:380980 [4] NCCL INFO Connected all rings
cf0b62962198:378933:380978 [1] NCCL INFO Connected all rings
cf0b62962198:378934:380975 [2] NCCL INFO Connected all rings
cf0b62962198:378935:380977 [3] NCCL INFO Connected all rings
cf0b62962198:378942:380981 [7] NCCL INFO Channel 01/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 02/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 03/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 04/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 05/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 06/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 07/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 08/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 09/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 10/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 11/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 12/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 13/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 14/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 15/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 16/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 17/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 18/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 19/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 20/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 21/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 22/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Channel 23/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 00/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 00/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 00/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 00/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 00/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 00/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 01/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 01/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 01/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 01/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 01/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 01/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 02/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 02/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 02/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 02/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 02/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 02/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 03/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 03/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 03/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 03/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 03/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 03/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 04/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 04/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 04/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 04/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 04/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 04/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 05/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 05/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 05/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 05/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 05/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 05/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 06/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 06/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 06/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 06/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 06/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 06/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 07/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 07/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 07/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 07/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 07/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 07/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 08/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 08/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 08/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 08/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 08/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 08/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 09/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 09/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 09/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 09/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 09/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 09/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 10/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 10/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 10/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 10/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 10/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 10/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 11/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 11/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 11/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 11/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 11/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 11/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 12/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 12/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 12/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 12/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 12/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 12/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 13/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 13/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 13/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 13/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 13/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 13/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 14/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 14/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 14/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 14/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 14/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 14/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 15/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 15/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 15/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 15/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 15/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 15/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 16/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 16/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 16/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 16/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 16/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 16/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 17/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 17/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 17/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 17/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 17/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 17/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 18/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 18/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 18/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 18/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 18/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 18/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 19/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 19/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 19/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 19/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 19/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 19/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 20/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 20/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 20/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 20/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 20/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 20/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 21/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 21/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 21/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 21/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 21/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 21/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 22/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 22/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 22/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 22/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 22/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 22/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:380979 [6] NCCL INFO Channel 23/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:380978 [1] NCCL INFO Channel 23/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:380980 [4] NCCL INFO Channel 23/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:380977 [3] NCCL INFO Channel 23/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:380976 [5] NCCL INFO Channel 23/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378934:380975 [2] NCCL INFO Channel 23/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:380981 [7] NCCL INFO Connected all trees
cf0b62962198:378942:380981 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378942:380981 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378932:380974 [0] NCCL INFO Connected all trees
cf0b62962198:378932:380974 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378932:380974 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378940:380979 [6] NCCL INFO Connected all trees
cf0b62962198:378940:380979 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378940:380979 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378938:380976 [5] NCCL INFO Connected all trees
cf0b62962198:378938:380976 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378938:380976 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378936:380980 [4] NCCL INFO Connected all trees
cf0b62962198:378936:380980 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378934:380975 [2] NCCL INFO Connected all trees
cf0b62962198:378934:380975 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378933:380978 [1] NCCL INFO Connected all trees
cf0b62962198:378933:380978 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378936:380980 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378933:380978 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378934:380975 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378935:380977 [3] NCCL INFO Connected all trees
cf0b62962198:378935:380977 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378935:380977 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378942:380981 [7] NCCL INFO comm 0x29386170 rank 7 nranks 8 cudaDev 7 busId 41040 - Init COMPLETE
cf0b62962198:378932:380974 [0] NCCL INFO comm 0x2928ef30 rank 0 nranks 8 cudaDev 0 busId b010 - Init COMPLETE
cf0b62962198:378938:380976 [5] NCCL INFO comm 0x2959c8a0 rank 5 nranks 8 cudaDev 5 busId 41020 - Init COMPLETE
cf0b62962198:378940:380979 [6] NCCL INFO comm 0x289ef590 rank 6 nranks 8 cudaDev 6 busId 41030 - Init COMPLETE
cf0b62962198:378936:380980 [4] NCCL INFO comm 0x28de4150 rank 4 nranks 8 cudaDev 4 busId 41010 - Init COMPLETE
cf0b62962198:378933:380978 [1] NCCL INFO comm 0x29a86dc0 rank 1 nranks 8 cudaDev 1 busId b020 - Init COMPLETE
cf0b62962198:378934:380975 [2] NCCL INFO comm 0x29deed20 rank 2 nranks 8 cudaDev 2 busId b030 - Init COMPLETE
cf0b62962198:378935:380977 [3] NCCL INFO comm 0x2a1fde90 rank 3 nranks 8 cudaDev 3 busId b040 - Init COMPLETE
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 1.3544182777404785 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.903181791305542 seconds
Loading extension module fused_adam...
[2024-01-02 07:19:26,345] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-02 07:19:26,345] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
Time to load fused_adam op: 0.20219802856445312 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.40269923210144043 seconds
Time to load fused_adam op: 0.5029723644256592 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.903390645980835 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 1.404487133026123 seconds
Time to load fused_adam op: 0.1026449203491211 seconds
cf0b62962198:378932:382521 [0] NCCL INFO Using network Socket
cf0b62962198:378936:382523 [4] NCCL INFO Using network Socket
cf0b62962198:378933:382522 [1] NCCL INFO Using network Socket
cf0b62962198:378934:382525 [2] NCCL INFO Using network Socket
cf0b62962198:378938:382524 [5] NCCL INFO Using network Socket
cf0b62962198:378940:382526 [6] NCCL INFO Using network Socket
cf0b62962198:378942:382527 [7] NCCL INFO Using network Socket
cf0b62962198:378935:382528 [3] NCCL INFO Using network Socket
cf0b62962198:378936:382523 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378933:382522 [1] NCCL INFO Setting affinity for GPU 1 to 03ffff,ffffffff,ffffffff
cf0b62962198:378934:382525 [2] NCCL INFO Setting affinity for GPU 2 to 03ffff,ffffffff,ffffffff
cf0b62962198:378940:382526 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378935:382528 [3] NCCL INFO Setting affinity for GPU 3 to 03ffff,ffffffff,ffffffff
cf0b62962198:378932:382521 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff,ffffffff,ffffffff
cf0b62962198:378938:382524 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378942:382527 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ffffffff,ffffffff,fffc0000,00000000,00000000
cf0b62962198:378938:382524 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
cf0b62962198:378932:382521 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378940:382526 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
cf0b62962198:378933:382522 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
cf0b62962198:378935:382528 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
cf0b62962198:378936:382523 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
cf0b62962198:378934:382525 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
cf0b62962198:378942:382527 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
cf0b62962198:378932:382521 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
cf0b62962198:378932:382521 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
cf0b62962198:378936:382523 [4] NCCL INFO Channel 00/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 00/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 00/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 00/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 00/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 00/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 00/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 00/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 01/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 01/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 01/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 01/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 01/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 01/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 01/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 01/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 02/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 02/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 02/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 02/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 02/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 02/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 02/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 02/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 03/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 03/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 03/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 03/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 03/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 03/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 03/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 03/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 04/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 04/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 04/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 04/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 04/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 04/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 04/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 04/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 05/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 05/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 05/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 05/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 05/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 05/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 05/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 05/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 06/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 06/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 06/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 06/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 06/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 06/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 06/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 06/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 07/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 07/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 07/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 07/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 07/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 07/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 07/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 07/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 08/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 08/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 08/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 08/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 08/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 08/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 08/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 08/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 09/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 09/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 09/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 09/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 09/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 09/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 09/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 09/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 10/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 10/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 10/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 10/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 10/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 10/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 10/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 10/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 11/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 11/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 11/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 11/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 11/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 11/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 11/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 11/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 12/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 12/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 12/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 12/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 12/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 12/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 12/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 12/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 13/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 13/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 13/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 13/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 13/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 13/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 13/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 13/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 14/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 14/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 14/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 14/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 14/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 14/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 14/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 14/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 15/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 15/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 15/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 15/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 15/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 15/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 15/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 15/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 16/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 16/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 16/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 16/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 16/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 16/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 16/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 16/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 17/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 17/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 17/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 17/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 17/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 17/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 17/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 17/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 18/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 18/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 18/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 18/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 18/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 18/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 18/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 18/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 19/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 19/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 19/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 19/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 19/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 19/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 19/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 19/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 20/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 20/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 20/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 20/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 20/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 20/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 20/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 20/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 21/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 21/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 21/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 21/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 21/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 21/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 21/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 21/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 22/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 22/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 22/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 22/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 22/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 22/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 22/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 22/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 23/0 : 5[41020] -> 6[41030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 23/0 : 4[41010] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 23/0 : 2[b030] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 23/0 : 1[b020] -> 2[b030] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 23/0 : 3[b040] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Channel 23/0 : 0[b010] -> 1[b020] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 23/0 : 6[41030] -> 7[41040] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 23/0 : 7[41040] -> 0[b010] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Connected all rings
cf0b62962198:378934:382525 [2] NCCL INFO Connected all rings
cf0b62962198:378935:382528 [3] NCCL INFO Connected all rings
cf0b62962198:378933:382522 [1] NCCL INFO Connected all rings
cf0b62962198:378938:382524 [5] NCCL INFO Connected all rings
cf0b62962198:378932:382521 [0] NCCL INFO Connected all rings
cf0b62962198:378942:382527 [7] NCCL INFO Connected all rings
cf0b62962198:378942:382527 [7] NCCL INFO Channel 00/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Connected all rings
cf0b62962198:378942:382527 [7] NCCL INFO Channel 01/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 02/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 03/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 04/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 05/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 06/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 07/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 08/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 09/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 10/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 11/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 12/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 13/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 14/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 15/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 16/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 17/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 18/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 19/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 20/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 21/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 22/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378942:382527 [7] NCCL INFO Channel 23/0 : 7[41040] -> 6[41030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 00/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 00/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 00/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 00/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 00/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 00/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 01/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 01/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 01/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 01/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 01/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 01/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 02/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 02/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 02/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 02/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 02/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 02/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 03/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 03/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 03/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 03/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 03/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 03/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 04/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 04/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 04/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 04/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 04/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 04/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 05/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 05/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 05/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 05/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 05/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 05/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 06/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 06/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 06/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 06/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 06/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 06/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 07/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 07/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 07/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 07/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 07/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 07/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 08/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 08/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 08/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 08/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 08/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 08/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 09/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 09/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 09/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 09/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 09/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 09/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 10/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 10/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 10/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 10/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 10/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 10/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 11/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 11/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 11/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 11/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 11/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 11/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 12/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 12/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 12/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 12/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 12/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 12/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 13/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 13/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 13/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 13/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 13/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 13/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 14/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 14/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 14/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 14/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 14/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 14/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 15/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 15/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 15/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 15/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 15/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 16/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 16/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 15/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 16/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 16/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 17/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 16/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 17/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 16/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 17/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 17/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 18/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 18/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 17/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 17/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 18/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 18/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 19/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 19/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 18/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 18/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 19/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 19/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 20/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 20/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 19/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 19/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 20/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 20/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 21/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 21/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 20/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 20/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 21/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 21/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 22/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 22/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 21/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 21/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 22/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 22/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378934:382525 [2] NCCL INFO Channel 23/0 : 2[b030] -> 1[b020] via P2P/IPC/read
cf0b62962198:378936:382523 [4] NCCL INFO Channel 23/0 : 4[41010] -> 3[b040] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 22/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 22/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378933:382522 [1] NCCL INFO Channel 23/0 : 1[b020] -> 0[b010] via P2P/IPC/read
cf0b62962198:378940:382526 [6] NCCL INFO Channel 23/0 : 6[41030] -> 5[41020] via P2P/IPC/read
cf0b62962198:378935:382528 [3] NCCL INFO Channel 23/0 : 3[b040] -> 2[b030] via P2P/IPC/read
cf0b62962198:378938:382524 [5] NCCL INFO Channel 23/0 : 5[41020] -> 4[41010] via P2P/IPC/read
cf0b62962198:378932:382521 [0] NCCL INFO Connected all trees
cf0b62962198:378932:382521 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378932:382521 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378932:0:382581] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378942:382527 [7] NCCL INFO Connected all trees
cf0b62962198:378942:382527 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378942:382527 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378942:0:382584] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 382581) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bbc0 __nss_database_lookup()  ???:0
 2 0x000000000006f425 ncclShmOpen()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/misc/shmutils.cc:52
 3 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:897
 4 0x0000000000065b8f proxyConnInit()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:963
 5 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:1105
 6 0x0000000000008609 start_thread()  ???:0
 7 0x000000000011f133 clone()  ???:0
=================================
==== backtrace (tid: 382584) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bbc0 __nss_database_lookup()  ???:0
 2 0x000000000006f425 ncclShmOpen()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/misc/shmutils.cc:52
 3 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:897
 4 0x0000000000065b8f proxyConnInit()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:963
 5 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:1105
 6 0x0000000000008609 start_thread()  ???:0
 7 0x000000000011f133 clone()  ???:0
=================================
cf0b62962198:378933:382522 [1] NCCL INFO Connected all trees
cf0b62962198:378933:382522 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378933:382522 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378933:0:382578] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378934:382525 [2] NCCL INFO Connected all trees
cf0b62962198:378934:382525 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378934:382525 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378934:0:382579] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378935:382528 [3] NCCL INFO Connected all trees
cf0b62962198:378935:382528 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378938:382524 [5] NCCL INFO Connected all trees
cf0b62962198:378938:382524 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378938:382524 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378938:0:382583] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378935:382528 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
cf0b62962198:378936:382523 [4] NCCL INFO Connected all trees
cf0b62962198:378936:382523 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
[cf0b62962198:378935:0:382582] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378936:382523 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378936:0:382577] Caught signal 7 (Bus error: nonexistent physical address)
cf0b62962198:378940:382526 [6] NCCL INFO Connected all trees
cf0b62962198:378940:382526 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
cf0b62962198:378940:382526 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[cf0b62962198:378940:0:382580] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 382578) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bbc0 __nss_database_lookup()  ???:0
 2 0x000000000006f425 ncclShmOpen()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/misc/shmutils.cc:52
 3 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:897
 4 0x0000000000065b8f proxyConnInit()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:963
 5 0x0000000000065b8f ncclProxyService()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/proxy.cc:1105
 6 0x0000000000008609 start_thread()  ???:0
 7 0x000000000011f133 clone()  ???:0
=================================
==== backtrace (tid: 382579) ====