hiyouga / LLaMA-Factory

A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
25.77k stars 3.2k forks source link

预训练效率问题 #3204

Closed 18140663659 closed 2 months ago

18140663659 commented 2 months ago

Reminder

Reproduction

accelerate launch src/train_bash.py \ --stage pt \ --model_name_or_path $model_name_or_path \ --do_train \ --dataset $dataset \ --streaming \ --max_steps 10000 \ --finetuning_type full \ --output_dir $output_dir \ --overwrite_cache \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 500 \ --save_total_limit 2 \ --learning_rate 5e-6 \ --num_train_epochs 1.0 \ --plot_loss \ --use_fast_tokenizer false \ --preprocessing_num_workers 64 \ --cutoff_len 2048 \ --bf16 \ --warmup_steps 10 \ --max_grad_norm 1.0 2>&1 | tee $output_dir/log.txt

Expected behavior

在7卡上运行以上预训练代码,10000steps训练大概要2天左右的时间,请问是否有提效的一些方式

System Info

torch==1.14.0a0+410ce96 uvicorn fastapi==0.95.1 sse-starlette tiktoken trl==0.7.4 peft>=0.4.0 accelerate>=0.21.0 jieba rouge-chinese gradio fsspec==2023.9.2 transformers==4.31.0

deepspeed==0.9.1

deepspeed==0.9.3 nltk openpyxl

Others

codemayq commented 2 months ago

不知道你是训练什么模型

  1. 在不爆显存的情况下,适当提高 per_device_train_batch_size
  2. 开启 flash attention 会快一点点