[Usage] Multi-GPU training hangs: Watchdog caught collective operation timeout

24-solar-terms commented 1 year ago

Describe the issue

Hi, when I use my own dataset, roughly 50w data, DDP training with 8 A100 80G, the training hangs and gives the following error:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802713 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803216 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802791 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802786 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15172, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803288 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802877 milliseconds before timing out.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
Traceback (most recent call last):
  File "/workdir/llava/train/train_mem.py", line 16, in <module>
    train()
  File "/workdir/llava/train/train.py", line 930, in train
    trainer.train()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/workdir/llava/model/language_model/llava_llama.py", line 75, in forward
    input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
  File "/workdir/llava/model/llava_arch.py", line 119, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "/workdir/llava/model/llava_arch.py", line 99, in encode_images
    image_features = self.get_model().get_vision_tower()(images)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workdir/llava/model/multimodal_encoder/donut_encoder.py", line 47, in forward
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workdir/llava/model/multimodal_encoder/donut.py", line 107, in forward
    x = self.model.layers(x)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 420, in forward
    x = self.blocks(x)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 310, in forward
    attn_windows = self.attn(x_windows, mask=self.attn_mask)  # num_win*B, window_size*window_size, C
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 216, in forward
    x = (attn @ v).transpose(1, 2).reshape(B_, N, -1)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/workdir/llava/train/train_mem.py", line 16, in <module>
    train()
  File "/workdir/llava/train/train.py", line 930, in train
    trainer.train()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1853, in backward
    loss.backward(**kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: NCCL communicator was aborted on rank 6.  Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
=====================================================

At the beginning, I thought maybe some corrupt images lead to the error, because I see cuda index error in above message, and traceback show the error in swin transformer, but I checked all images use PIL Image.open and deleted all images with warning, no problem found, the training still stuck. I also check input image tensor size and they are right. I searched many way in community, like use the following environment parameter:

CUDA_LAUNCH_BLOCKING= 1
NCCL_P2P_LEVEL=2
NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=ALL
TORCH_DISTRIBUTED_DEBUG=INFO
NCCL_IB_TIMEOUT=22
NCCL_BLOCKING_WAIT=0
unset LD_LIBRARY_PATH

it still didn't work. Then I tried to use 2 GPU training and batch size per device use 1, and print image path to find the stuck data, but I found the data is ok, and I constructed a dataset only contained the 2 images, the training process didn't stuck and worked.

However, when I training on single GPU, it works fine, when I training use other datasets on DDP mode, it works fine. So I think the code is ok and it seems there are some problems in the dataset but since single GPU worked and the dataset once used to training other model before, it seems no problems in the dataset.

I also use the following code at beginning of train.py:

torch.distributed.init_process_group(backend="gloo")

just get the error message:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3772 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3773 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3774 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3775 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3776 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3779 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 5 (pid: 3777) of binary: /miniconda/envs/llava/bin/python3
Traceback (most recent call last):
  File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
llava/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-21_12:47:26
  host      : psx1kopxqb355ls7-worker-0
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 3777)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3777
=====================================================

I'm so confused, and I don't know what can I do next.

haotian-liu commented 1 year ago

Hi, please try with the latest DeepSpeed commands, thanks.

1359347500cwc commented 11 months ago

Have you solved this problem? I also encountered the same problem. In my case, this happened during training after 2444steps. The dataset contains about 540k customer data.

and I try finetune it using 44（a100-40g） though this script `python -m torch.distributed.run --nnodes=4 \ --node_rank=$RANK \ --nproc_per_node=4 \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ ${MODLE_DIR}/train_mem.py \ --deepspeed ${MODLE_DIR}/scripts/zero3.json \ --model_name_or_path /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/ckpt/llava-v1.5-7b \ --version v1 \ --data_path /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/data_filter/mix_540k_ocr_translate_new.json \ --image_folder ${IMAGE_DIR} \ --vision_tower /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/ckpt/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir /mnt/chongqinggeminiceph1fs/geminicephfs/pr-training-mt/cwctchen/cwctchen/LLava_workspace/checkpoints/checkpoints/llava-v1.5-7b-ocr_translate_task_44card_new \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to none ` 截屏2023-11-22 18 35 25

pkulwj1994 commented 6 months ago

I meet the same issue and solved the issue by updating the deepspeed to the latest version through

pip install -U deepspeed

Thanks to Haotian's @haotian-liu constructive suggestion!

Echo0125 commented 5 months ago

I encountered the same issue on my own data. When the batch size is set to 16, there are no problems, but when the batch size is set to 8 and gradient_accumulation_steps is set to 2, it hangs. It also hangs when I add new data. I tried updating the versions of torch, deepspeed, and accelerator, but it did not resolve the issue.

24-solar-terms commented 5 months ago

@Echo0125 Maybe check the input sequence length after replacing image placeholder token by real image embeddings

24-solar-terms commented 5 months ago

My problem is solved by check input sequence length after replacing image placeholder token by real image embeddings. In my dataset, there is long prompt making total input length longer than max sequence length after replacing image placeholder token by real image embeddings, which leads "../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [70,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed." error

Andrew-Zhang commented 4 months ago

I encountered the same issue on my own data. When the batch size is set to 16, there are no problems, but when the batch size is set to 8 and gradient_accumulation_steps is set to 2, it hangs. It also hangs when I add new data. I tried updating the versions of torch, deepspeed, and accelerator, but it did not resolve the issue.

@Echo0125 Did you manage to solve this issue? I get the same problem where gradient accumulation leads to an error.

zmtttt commented 2 months ago

have you sloved? I met the same problem

haotian-liu / LLaVA

[Usage] Multi-GPU training hangs: Watchdog caught collective operation timeout #447

Describe the issue