Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size?
e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?
Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?
If the training loss is reduced, is it good for performing well on NLP general tasks?
For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?
I know you are very busy, but I still looking forward to your reply, thanks!
Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and gradient_accumulation_steps (maybe per_device_train_batch_size isn't so cirtical? right)
Reminder
Reproduction
It's an awesome project! Thank you wonderful contributions!
For an example repo about stf by using deepspeed:
deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ --model_name_or_path "xxx" \ --do_train \ --dataset alpaca_en \ --dataset_dir ./data \ --finetuning_type lora \ --output_dir "xxx" \ --overwrite_cache \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3 \ --plot_loss \ --fp16 \ --template default \ --deepspeed "scripts/ds_z3_config_lora.json"
Here some questions in multi-gpus finetuning:
Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size? e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?
Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?
If the training loss is reduced, is it good for performing well on NLP general tasks?
For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?
I know you are very busy, but I still looking forward to your reply, thanks!
Expected behavior
No response
System Info
No response
Others
No response