Closed shamanez closed 6 months ago
Hello @shamanez ! In our experiments we found that the global batch size
global_bs = per_device_train_batch_size * gradient_accumulation_steps * num_gpus
can have a non-trivial effect on the downstream performance (especially for fine-tuning). A related aspect is that with QLoRA one cannot shard the model weights with ZeRO-3 so we typically need to scale down the per device batch size
Ok, now I understand. Thanks a lot.
https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-beta/sft/config_lora.yaml