[BUG]zero2/3 stage training nccl error for deepspeed version > 0.9.5

KevinPanJun commented 7 months ago

while I am using the deepspeed package: the following deepspeed configuration works well for deepspeed==0.9.5 { "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1e-100 }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "wall_clock_breakdown": false }

But for deepspeed version > 0.9.5 (I test 0.12.6 and 0.13.1), NCCL error will happen: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.17.1 File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "local/lib/python3.8/site-packages/transformers/trainer.py", line 2242, in _maybe_log_save_evaluate tr_loss_scalar = self._nested_gather(tr_loss).mean().item() File "local/lib/python3.8/site-packages/transformers/trainer.py", line 3345, in _nested_gather tensors = distributed_concat(tensors) File ".local/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 204, in distributed_concat dist.all_gather(output_tensors, tensor) File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2275, in all_gather work = default_pg.allgather([tensor_list], [tensor]) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.17.1 ncclInternalError: Internal check failed.

Do we have some breaking change for later version deepspeed package?

awan-10 commented 7 months ago

I am not aware of any changes that should cause an error like this. Can you share a small reproducer? Also, please share the version of transformers as well. NCCL errors are hard to reproduce without the specific platform details as well. Please share pytorch version and the machine details you are seeing this error on.

GuanhuaWang commented 7 months ago

Also @KevinPanJun could you provide the hardward setup? what kind of GPU you are using, how many gpu per node, how many nodes?

KevinPanJun commented 7 months ago

@GuanhuaWang and @awan-10I was using 4 node with 8 gpus (Nvidia V100 with 32G memory) for each node. Packages: torch 1.13.1 torch-nebula 0.16.5 torch-ort 1.15.0 torchaudio 0.13.1+cu117 torchmetrics 0.11.3 torchsnapshot 0.1.0 torchvision 0.14.1+cu117 transformers 4.36.2 I check the job: the node3 failed with cuda oom: node-0:2874:7128 [3] NCCL INFO === System : maxBw 1.2 totalBw 132.0 === node-0:2874:7128 [3] NCCL INFO CPU/0 (1/1/2) node-0:2874:7128 [3] NCCL INFO + PCI[5000.0] - NIC/0 node-0:2874:7128 [3] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000) node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/100000 (0) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/400000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/200000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/300000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/700000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/200000 (1) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/100000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/300000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/800000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/400000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/300000 (2) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/500000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/200000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/400000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/100000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/400000 (3) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/100000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/600000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/200000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/300000 node-0:2874:7128 [3] NCCL INFO + SYS[9.0] - CPU/1 node-0:2874:7128 [3] NCCL INFO CPU/1 (1/1/2) node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/500000 (4) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/300000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/800000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/600000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/700000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/600000 (5) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/700000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/400000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/800000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/500000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/700000 (6) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/600000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/800000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/100000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/500000 node-0:2874:7128 [3] NCCL INFO + PCI[12.0] - GPU/800000 (7) node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/700000 node-0:2874:7128 [3] NCCL INFO + NVL[44.0] - GPU/500000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/600000 node-0:2874:7128 [3] NCCL INFO + NVL[22.0] - GPU/200000 node-0:2874:7128 [3] NCCL INFO + SYS[9.0] - CPU/0 node-0:2874:7128 [3] NCCL INFO ========================================== node-0:2874:7128 [3] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (1/44.000000/NVL) GPU/300000 (1/22.000000/NVL) GPU/400000 (1/44.000000/NVL) GPU/500000 (2/22.000000/NVB) GPU/600000 (2/44.000000/NVB) GPU/700000 (1/22.000000/NVL) GPU/800000 (2/22.000000/NVB) CPU/0 (1/12.000000/PHB) CPU/1 (2/12.000000/PHB) NET/0 (3/1.250000/PHB) node-0:2874:7128 [3] NCCL INFO GPU/200000 :GPU/100000 (1/44.000000/NVL) GPU/200000 (0/5000.000000/LOC) GPU/300000 (1/44.000000/NVL) GPU/400000 (1/22.000000/NVL) GPU/500000 (2/44.000000/NVB) GPU/600000 (2/22.000000/NVB) GPU/700000 (2/22.000000/NVB) GPU/800000 (1/22.000000/NVL) CPU/0 (1/12.000000/PHB) CPU/1 (2/12.000000/PHB) NET/0 (3/1.250000/PHB) node-0:2874:7128 [3] NCCL INFO GPU/300000 :GPU/100000 (1/22.000000/NVL) GPU/200000 (1/44.000000/NVL) GPU/300000 (0/5000.000000/LOC) GPU/400000 (1/22.000000/NVL) GPU/500000 (1/44.000000/NVL) GPU/600000 (2/22.000000/NVB) GPU/700000 (2/22.000000/NVB) GPU/800000 (2/44.000000/NVB) CPU/0 (1/12.000000/PHB) CPU/1 (2/12.000000/PHB) NET/0 (3/1.250000/PHB) node-0:2874:7128 [3] NCCL INFO GPU/400000 :GPU/100000 (1/44.000000/NVL) GPU/200000 (1/22.000000/NVL) GPU/300000 (1/22.000000/NVL) GPU/400000 (0/5000.000000/LOC) GPU/500000 (2/22.000000/NVB) GPU/600000 (1/44.000000/NVL) GPU/700000 (2/44.000000/NVB) GPU/800000 (2/22.000000/NVB) CPU/0 (1/12.000000/PHB) CPU/1 (2/12.000000/PHB) NET/0 (3/1.250000/PHB) node-0:2874:7128 [3] NCCL INFO GPU/500000 :GPU/100000 (2/22.000000/NVB) GPU/200000 (2/44.000000/NVB) GPU/300000 (1/44.000000/NVL) GPU/400000 (2/22.000000/NVB) GPU/500000 (0/5000.000000/LOC) GPU/600000 (1/22.000000/NVL) GPU/700000 (1/22.000000/NVL) GPU/800000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) node-0:2874:7128 [3] NCCL INFO GPU/600000 :GPU/100000 (2/44.000000/NVB) GPU/200000 (2/22.000000/NVB) GPU/300000 (2/22.000000/NVB) GPU/400000 (1/44.000000/NVL) GPU/500000 (1/22.000000/NVL) GPU/600000 (0/5000.000000/LOC) GPU/700000 (1/44.000000/NVL) GPU/800000 (1/22.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) node-0:2874:7128 [3] NCCL INFO GPU/700000 :GPU/100000 (1/22.000000/NVL) GPU/200000 (2/22.000000/NVB) GPU/300000 (2/22.000000/NVB) GPU/400000 (2/44.000000/NVB) GPU/500000 (1/22.000000/NVL) GPU/600000 (1/44.000000/NVL) GPU/700000 (0/5000.000000/LOC) GPU/800000 (1/44.000000/NVL) CPU/0 (2/12.000000/PHB) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) node-0:2874:7128 [3] NCCL INFO GPU/800000 :GPU/100000 (2/22.000000/NVB) GPU/200000 (1/22.000000/NVL) GPU/300000 (2/44.000000/NVB) GPU/400000 (2/22.000000/NVB) GPU/500000 (1/44.000000/NVL) GPU/600000 (1/22.000000/NVL) GPU/700000 (1/44.000000/NVL) GPU/800000 (0/5000.000000/LOC) CPU/0 (2/12.000000/PHB) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) node-0:2874:7128 [3] NCCL INFO NET/0 :GPU/100000 (3/1.250000/PHB) GPU/200000 (3/1.250000/PHB) GPU/300000 (3/1.250000/PHB) GPU/400000 (3/1.250000/PHB) GPU/500000 (4/1.250000/SYS) GPU/600000 (4/1.250000/SYS) GPU/700000 (4/1.250000/SYS) GPU/800000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) node-0:2874:7128 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff node-0:2874:7128 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type NVL/PHB, sameChannels 1 node-0:2874:7128 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/4 GPU/7 GPU/6 GPU/5 GPU/3 NET/0 node-0:2874:7128 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 2.400000/1.200000, type NVL/PHB, sameChannels 1 node-0:2874:7128 [3] NCCL INFO 0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/4 GPU/7 GPU/6 GPU/5 GPU/3 NET/0 node-0:2874:7128 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1 node-0:2874:7128 [3] NCCL INFO Ring 00 : 5 -> 3 -> 8 node-0:2874:7128 [3] NCCL INFO Ring 01 : 5 -> 3 -> 8 node-0:2874:7128 [3] NCCL INFO Trees [0] -1/-1/-1->3->5 [1] -1/-1/-1->3->5 node-0:2874:7128 [3] NCCL INFO P2P Chunksize set to 131072 node-0:2874:7128 [3] NCCL INFO Rank 3 selecting transport for rank 8 node-0:2874:7128 [3] NCCL INFO Transport 0 canConnect 0 node-0:2874:7128 [3] NCCL INFO Transport 1 canConnect 0 node-0:2874:7128 [3] NCCL INFO Transport 2 canConnect 1 node-0:2874:7128 [3] NCCL INFO Channel 00/0 : 3[400000] -> 8[100000] [send] via NET/Socket/0 node-0:2874:7128 [3] NCCL INFO Rank 3 selecting transport for rank 8 node-0:2874:7128 [3] NCCL INFO Transport 0 canConnect 0 node-0:2874:7128 [3] NCCL INFO Transport 1 canConnect 0 node-0:2874:7128 [3] NCCL INFO Transport 2 canConnect 1 node-0:2874:7128 [3] NCCL INFO Channel 01/0 : 3[400000] -> 8[100000] [send] via NET/Socket/0 node-0:2874:7128 [3] NCCL INFO Rank 3 selecting transport for rank 5 node-0:2874:7128 [3] NCCL INFO Transport 0 canConnect 1

  node-0:2874:7145 [3] include/alloc.h:99 NCCL WARN Cuda failure 'out of memory'

  node-0:2874:7145 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 10485760 bytes
  node-0:2874:7145 [3] NCCL INFO transport/p2p.cc:449 -> 1
  node-0:2874:7145 [3] NCCL INFO proxy.cc:1299 -> 1
  node-0:2874:7145 [3] NCCL INFO proxy.cc:1359 -> 1

  node-0:2874:7145 [3] proxy.cc:1494 NCCL WARN [Proxy Service 3] Failed to execute operation Setup from rank 3, retcode 1

  node-0:2874:7128 [3] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer node-0<54603>
  node-0:2874:7128 [3] NCCL INFO misc/socket.cc:746 -> 6

  node-0:2874:7128 [3] proxy.cc:1107 NCCL WARN Socket recv failed while polling for opId=0x7f2a781e1460
  node-0:2874:7128 [3] NCCL INFO transport/p2p.cc:302 -> 3
  node-0:2874:7128 [3] NCCL INFO transport.cc:35 -> 3
  node-0:2874:7128 [3] NCCL INFO transport.cc:102 -> 3
  node-0:2874:7128 [3] NCCL INFO init.cc:915 -> 3
  node-0:2874:7128 [3] NCCL INFO init.cc:1133 -> 3
  node-0:2874:7128 [3] NCCL INFO group.cc:67 -> 3 [Async thread]
  node-0:2874:2874 [3] NCCL INFO group.cc:437 -> 3
  node-0:2874:2874 [3] NCCL INFO group.cc:117 -> 3
  node-0:2874:2874 [3] NCCL INFO MSCCL: Teardown finished
  node-0:2874:2874 [3] NCCL INFO comm 0x90b0ecc0 rank 3 nranks 32 cudaDev 3 busId 400000 - Abort COMPLETE
  Traceback (most recent call last):
    File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
      return inner_training_loop(
    File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
      self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
    File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 2242, in _maybe_log_save_evaluate
      tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
    File ".local/lib/python3.8/site-packages/transformers/trainer.py", line 3345, in _nested_gather
      tensors = distributed_concat(tensors)
    File ".local/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 204, in distributed_concat
      dist.all_gather(output_tensors, tensor)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2275, in all_gather
      work = default_pg.allgather([tensor_list], [tensor])
  RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.17.1
  ncclInternalError: Internal check failed.

microsoft / DeepSpeed

[BUG]zero2/3 stage training nccl error for deepspeed version > 0.9.5 #5022