断点续训时，显存要求增大了？

kratorado commented 3 months ago

用swift微调qwen1.5-14B时，初始运行很正常，但是断点续训后，报错了，报错信息如下

[INFO:swift] Setting model.config.use_cache: False
[WARNING:modelscope] Reusing dataset dataset_builder (/home/devops/.cache/modelscope/hub/datasets/AI-ModelScope/hh_rlhf_cn/master/data_files)
[INFO:modelscope] Generating dataset dataset_builder (/home/devops/.cache/modelscope/hub/datasets/AI-ModelScope/hh_rlhf_cn/master/data_files)
[INFO:modelscope] Reusing cached meta-data file: /home/devops/.cache/modelscope/hub/datasets/AI-ModelScope/hh_rlhf_cn/master/data_files/042c234b69de5779cdd75934ad9c9a94
Traceback (most recent call last):
  File "/data/homedir/work/swift-play/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/homedir/work/swift-play/swift/swift/utils/run_utils.py", line 31, in x_main
    result = llm_x(args, **kwargs)
  File "/data/homedir/work/swift-play/swift/swift/llm/sft.py", line 91, in llm_sft
    model, callbacks = prepare_model(model, args)
  File "/data/homedir/work/swift-play/swift/swift/llm/tuner.py", line 151, in prepare_model
    model = Swift.from_pretrained(
  File "/data/homedir/work/swift-play/swift/swift/tuners/base.py", line 963, in from_pretrained
    return SwiftModel.from_pretrained(
  File "/data/homedir/work/swift-play/swift/swift/tuners/base.py", line 412, in from_pretrained
    self.load_state_dict(state_dict, adapter_name=_adapter)
  File "/data/homedir/work/swift-play/swift/swift/tuners/base.py", line 149, in load_state_dict
    incompatible_keys = self.base_model.load_state_dict(state_dict, False)
  File "/home/devops/miniconda3/envs/swift-training/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Qwen2ForCausalLM:
    While copying the parameter named "model.layers.0.self_attn.q_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 5120]) and whose dimensions in the checkpoint are torch.Size([8, 5120]), an exception occurred : ('CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n',).
    While copying the parameter named "model.layers.0.self_attn.q_proj.lora_B.default.weight", whose dimensions in the model are torch.Size([5120, 8]) and whose dimensions in the checkpoint are torch.Size([5120, 8]), an exception occurred : ('CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n',).
    While copying the parameter named "model.layers.0.self_attn.k_proj.lora_A.default.weight", whose dimensions in the model are torch.Size([8, 5120]) and whose dimensions in the checkpoint are torch.Size([8, 5120]), an exception occurred : ('CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n',).
...
# 下面是大量的重复信息

训练命令：

MKL_SERVICE_FORCE_INTEL=1 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft            \
--model_type qwen1half-14b-chat         \
--dataset hh-rlhf-cn          \
--train_dataset_sample -1       \
--logging_steps 10      \
--max_length 4096       \
--warmup_ratio 0.03             \
--output_dir output             \
--train_dataset_mix_ratio 2.0 \
--num_train_epochs 3 \
--lora_rank 8  \
--lora_alpha 32 \
--lora_dropout_p 0.05  \
--batch_size 1 \
--weight_decay 0.01 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 8 \
--max_grad_norm 0.5 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--deepspeed default-zero3

续传命令就增加了

--resume_from_checkpoint

环境：ubuntu22.04 cuda12.3 pytorch2.1.2 swift版本：源码安装 hash=b039ea781834480349e23632cdfdf9df6484c506 硬件：V100 32G * 8

Jintao-Huang commented 3 months ago

感觉是个bug

Elissa0723 commented 3 weeks ago

我遇到了类似的问题，训练Qwen2-57B-A14B-Instruct的时候，脚本如下：

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft --model_type qwen2-57b-a14b-instruct --model_id_or_path /publicdata/huggingface.co/Qwen/Qwen2-57B-A14B-Instruct --num_train_epochs 5 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 5e-5 --sft_type lora --dataset /workspace/huj11@xiaopeng.com/code/swift/data/train_all_0426_v5_swift.json --output_dir /dataset/huj11/ft_models/qwen2_57b_moe_0426_v5/ --use_flash_attn true --resume_from_checkpoint /dataset/huj11/ft_models/qwen2_57b_moe_0426_v5/qwen2-57b-a14b-instruct/v8-20240613-172241/checkpoint-1450

断点重连后报错：RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

swift版本：2.2.0.dev0

Jintao-Huang commented 1 week ago

我遇到了类似的问题，训练Qwen2-57B-A14B-Instruct的时候，脚本如下：

NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft --model_type qwen2-57b-a14b-instruct --model_id_or_path /publicdata/huggingface.co/Qwen/Qwen2-57B-A14B-Instruct --num_train_epochs 5 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 5e-5 --sft_type lora --dataset /workspace/huj11@xiaopeng.com/code/swift/data/train_all_0426_v5_swift.json --output_dir /dataset/huj11/ft_models/qwen2_57b_moe_0426_v5/ --use_flash_attn true --resume_from_checkpoint /dataset/huj11/ft_models/qwen2_57b_moe_0426_v5/qwen2-57b-a14b-instruct/v8-20240613-172241/checkpoint-1450

断点重连后报错：RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

swift版本：2.2.0.dev0

这个感觉是显存不足

modelscope / swift

断点续训时，显存要求增大了？ #570