hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.21k stars 3.72k forks source link

Slow batched evals #4801

Open shreyaspimpalgaonkar opened 1 month ago

shreyaspimpalgaonkar commented 1 month ago

Reminder

System Info

llamafactory-version 0.8.3.dev0 python 3.11.9 AWS EC2 instance

Reproduction

llamafactory-cli train \
      --stage sft \
      --model_name_or_path /home/ft/ \ # phi3-3.8b model
      --preprocessing_num_workers 16 \
      --finetuning_type full \
      --template phi \
      --flash_attn fa2 \
      --dataset_dir data \
      --dataset triples_new_ds \
      --cutoff_len 4096 \
      --max_samples 500 \
      --per_device_eval_batch_size 8 \
      --predict_with_generate True \
      --max_new_tokens 1024 \
      --top_p 1 \
      --temperature 1 \
      --output_dir <out_dir> \
      --do_predict True \
      --quantization_method bitsandbytes \
      --seed $seed

Expected behavior

Hi,

The above script does batched eval on 500 examples on an A100 node with batch size 8, and takes an hour to run. This is significantly slower than running evals with batch size 1, which runs in around 15 minutes. Do you know why this might be happening (maybe the longest generation in each batch is the bottleneck)? And, is there a way to make the batched evals faster? The model is small, so I want to have some parallelization to use the full available GPU resources. Thanks so much!

Others

Thanks again for the great work!

codemayq commented 1 month ago

Try to set per_device_eval_batch_size with 4 or 2, and see the speed difference.

Rocky77JHxu commented 1 month ago

I have the same question, I hope someone can answer it.