fix: Zephyr LoRA fine-tuning fixed

I tried using LoRA fine-tuning instead of QLoRA fine-tuning and it didn't work: I used exactly your training config and if I train LoRA, the loss would become 0 without bf16 specified in the config.

With bf16 issue is resolved (I trained LoRA model for 50% of the sft data and got the expected results). Furthermore, I re-used this bf16 config for QLoRA and the results are the same as you report (~0.95 SFT loss)

Also, I added the flash attention 2 flag as it speeds up training, allows doubling the batch for QLoRA (per gpu batch size 4 -> 8) while not changing the results at all (just to be safe, I tested it too and the curves are the same)

P.S. in my photos, 1% means that I trained for 1% of SFT steps just to make sure the losses are identical and changing the flag does not break anything

In total, this PR fixes LoRA that didn't work before (due to loss = 0) and speeds up QLoRA & LoRA configurations by flash attention flag

huggingface / alignment-handbook

fix: Zephyr LoRA fine-tuning fixed #139