hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.67k stars 4k forks source link

Could you please share some tips with your rich experience? #3452

Closed xiaochengsky closed 4 months ago

xiaochengsky commented 5 months ago

Reminder

Reproduction

It's an awesome project! Thank you wonderful contributions!

For an example repo about stf by using deepspeed:

deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ --model_name_or_path "xxx" \ --do_train \ --dataset alpaca_en \ --dataset_dir ./data \ --finetuning_type lora \ --output_dir "xxx" \ --overwrite_cache \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3 \ --plot_loss \ --fp16 \ --template default \ --deepspeed "scripts/ds_z3_config_lora.json"

Here some questions in multi-gpus finetuning:

  1. Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size? e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?

  2. Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?

  3. If the training loss is reduced, is it good for performing well on NLP general tasks?

  4. For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?

I know you are very busy, but I still looking forward to your reply, thanks!

Expected behavior

No response

System Info

No response

Others

No response

xiaochengsky commented 5 months ago

Maybe I should update the first question.

  1. Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and gradient_accumulation_steps (maybe per_device_train_batch_size isn't so cirtical? right)
hiyouga commented 4 months ago

You can use https://github.com/hiyouga/LLaMA-Factory/blob/main/scripts/cal_lr.py