Encounter a “CUDA error: device-side assert triggered” when fine-tuning llama2 on V100

kanxueli commented 1 year ago

Describe the issue

Issue: When I run finetune_qlora.sh on v100 for finetuning llama2 I get CUDA error. Do you konw how to solve it? Thanks a lot. @haotian-liu Command:

Here is configuration of finetune_qlora.sh:
deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --bits 4 \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path $DATASET_PATH/llava_instruct_80k.json \
    --image_folder $DATASET_PATH/coco/train2017 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter $MODEL_CHECKPOINTS/llava-pretrain-$MODEL_VERSION/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --fp16 True \
    --output_dir ./checkpoints/llava-$MODEL_VERSION-finetune_lora \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \

Log:

./aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [47,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    train()
  File "/home/likanxue1/model_inference/LLaVA/llava/train/train.py", line 910, in train
    trainer.train()
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/likanxue1/.custom/cuda11.3.1-cudnn8-devel-ubuntu20.04-py38-jupyter/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/likanxue1/model_inference/LLaVA/llava/model/language_model/llava_llama.py", line 75, in forward
    input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
  File "/home/likanxue1/model_inference/LLaVA/llava/model/llava_arch.py", line 140, in prepare_inputs_labels_for_multimodal
    cur_new_labels.append(cur_labels[:image_token_start])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

kanxueli commented 1 year ago

I find that it does not encounter the above error, but the loss is always 0 when I try to change the value of 'mm_use_im_patch_token' from False to True. I noticed that you asked me to make sure that 'mm_use_im_patch_token' is correctly set as False when using these projector weights to instruction tune my LMM.I trying the above because I wanted the program to run smoothly . And I want to know why the training loss becomes 0 when these flags are set to True during fine-tuning. @haotian-liu

Chen-Song commented 7 months ago

I also meet this error of "“CUDA error: device-side assert triggered” when fine-tuning llama2 on V100", how to solve it?

haotian-liu / LLaVA

Encounter a “CUDA error: device-side assert triggered” when fine-tuning llama2 on V100 #445

Describe the issue