hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.71k stars 4.01k forks source link

DoRA能够正常训练,但训练生成的检查点加载后占用显存过高。 #5144

Closed frozenarctic closed 1 month ago

frozenarctic commented 2 months ago

Reminder

System Info

bin D:\LLaMA-Factory\venv\Lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll

Reproduction

llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path models/Qwen2-7B-Instruct \ --preprocessing_num_workers 16 \ --finetuning_type lora \ --template qwen \ --flash_attn auto \ --dataset_dir data \ --dataset id_zh,qa_zh \ --cutoff_len 2048 \ --learning_rate 1e-05 \ --num_train_epochs 3.0 \ --max_samples 100000 \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 20 \ --save_steps 400 \ --warmup_steps 0 \ --neftune_noise_alpha 0.1 \ --optim adamw_torch \ --packing False \ --report_to none \ --output_dir saves\Qwen2-7B-Chat\lora\checkpoint \ --bf16 True \ --plot_loss True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True \ --lora_rank 64 \ --lora_alpha 128 \ --lora_dropout 0.05 \ --create_new_adapter True \ --use_rslora True \ --use_dora True \ --lora_target all

Expected behavior

未启用dora参数训练Qwen2-7B-instruct,加载检查点进行对话的显存占用为18G左右,如下图: no dora

启用dora参数训练,加载检查点进行对话,显存占用异常的高,约为39G,如下图: dora ckpt

推理的显存占用甚至都比训练时的显存占用还要高,dora训练Qwen2-7B-instruct时大约占用34G显存,如下图: dora train

Others

No response

SinlinLi commented 3 weeks ago

遇到了同样的问题,训练显存都够用,推理反而OOM了