报错信息:
Traceback (most recent call last):
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in
sft_main()
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
result = llm_x(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/llm/sft.py", line 414, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/trainers/mixin.py", line 409, in train
res = super().train(resume_from_checkpoint, args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 1991, in train
return inner_training_loop(
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2332, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3424, in training_step
loss = self.compute_loss(model, inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/trainers/trainers.py", line 170, in compute_loss
outputs = model(inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 819, in forward
return model_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 807, in call
return convert_to_fp32(self.model_forward(args, kwargs))
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
return func(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/peft/peft_model.py", line 1430, in forward
return self.base_model(
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
return self.model.forward(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1581, in forward
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1018, in forward
hidden_states = self.patch_embed(hidden_states)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 243, in forward
hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(args, **kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 608, in forward
return self._conv_forward(input, self.weight, self.bias)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
return F.conv3d(
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Your hardware and system info
torch==2.4
transformers==4.45.dev0
torchvision==0.19.0
4*V100
NVIDIA-SMI 535.154.05
Driver Version: 535.154.05
CUDA Version: 12.2
Describe the bug 微调命令: CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \ --model_type qwen2-vl-7b-instruct\ --model_id_or_path ../model/Qwen2-VL-7B-Instruct/qwen/Qwen2-VL-7B-Instruct/\ --dataset ./data/json/internvl_train_cn.json \ --output_dir ./output/qwen2_vl_7b_instruct/model\ --max_length 2048\ --train_dataset_sample -1 \ --num_train_epochs 10 \ --check_dataset_strategy warning \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout_p 0.05 \ --lora_target_modules DEFAULT \ --gradient_checkpointing true \ --batch_size 1 \ --learning_rate 1e-4 \ --gradient_accumulation_steps 1 \ --max_grad_norm 0.5 \ --eval_steps 1000 \ --save_strategy epoch \ --save_total_limit -1 \ --logging_steps 1000 \ --use_flash_attn false \ --lora_target_modules ALL \ --weight_decay 0.001 \ --warmup_ratio 0.05
报错信息: Traceback (most recent call last): File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in
sft_main()
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
result = llm_x(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/llm/sft.py", line 414, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/trainers/mixin.py", line 409, in train
res = super().train(resume_from_checkpoint, args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 1991, in train
return inner_training_loop(
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 2332, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/trainer.py", line 3424, in training_step
loss = self.compute_loss(model, inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/swift/trainers/trainers.py", line 170, in compute_loss
outputs = model(inputs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 819, in forward
return model_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/utils/operations.py", line 807, in call
return convert_to_fp32(self.model_forward(args, kwargs))
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
return func(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/peft/peft_model.py", line 1430, in forward
return self.base_model(
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
return self.model.forward(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1581, in forward
image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw).to(inputs_embeds.device)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1018, in forward
hidden_states = self.patch_embed(hidden_states)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 243, in forward
hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, *kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(args, **kwargs)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 608, in forward
return self._conv_forward(input, self.weight, self.bias)
File "~/miniconda3/envs/swift/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 603, in _conv_forward
return F.conv3d(
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Your hardware and system info torch==2.4 transformers==4.45.dev0 torchvision==0.19.0 4*V100 NVIDIA-SMI 535.154.05
Driver Version: 535.154.05
CUDA Version: 12.2
Additional context 使用4张V100显卡,如果加上 --dtype fp32 或 --dtype fp16,均会发生显卡溢出错误