Why we use a lower batch size when comparing SFT lora with SFT full fine-tuning ?

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.2k stars 357 forks source link

Why we use a lower batch size when comparing SFT lora with SFT full fine-tuning ? #81

Closed shamanez closed 6 months ago

shamanez commented 6 months ago

https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/sft/config_lora.yaml

lewtun commented 6 months ago

Hello @shamanez ! In our experiments we found that the global batch size

global_bs = per_device_train_batch_size * gradient_accumulation_steps * num_gpus

can have a non-trivial effect on the downstream performance (especially for fine-tuning). A related aspect is that with QLoRA one cannot shard the model weights with ZeRO-3 so we typically need to scale down the per device batch size

shamanez commented 6 months ago

Ok, now I understand. Thanks a lot.