Reminder

[X] I have read the README and searched the existing issues.

Reproduction

accelerate launch src/train_bash.py \ --stage pt \ --model_name_or_path $model_name_or_path \ --do_train \ --dataset $dataset \ --streaming \ --max_steps 10000 \ --finetuning_type full \ --output_dir $output_dir \ --overwrite_cache \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 500 \ --save_total_limit 2 \ --learning_rate 5e-6 \ --num_train_epochs 1.0 \ --plot_loss \ --use_fast_tokenizer false \ --preprocessing_num_workers 64 \ --cutoff_len 2048 \ --bf16 \ --warmup_steps 10 \ --max_grad_norm 1.0 2>&1 | tee $output_dir/log.txt

Expected behavior

在7卡上运行以上预训练代码，10000steps训练大概要2天左右的时间，请问是否有提效的一些方式

System Info

torch==1.14.0a0+410ce96 uvicorn fastapi==0.95.1 sse-starlette tiktoken trl==0.7.4 peft>=0.4.0 accelerate>=0.21.0 jieba rouge-chinese gradio fsspec==2023.9.2 transformers==4.31.0

deepspeed==0.9.1

deepspeed==0.9.3 nltk openpyxl

Others

无

hiyouga / LLaMA-Factory

预训练效率问题 #3204