Apply quantization during DPO QLoRA

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

Apache License 2.0

4.2k stars 357 forks source link

This PR fixes a bug where we weren't quantising the base model with QLoRA during DPO and thus were actually doing LoRA instead.

Now we first quantise the base model in 4bit and load the SFT adapter (which later gets merged within the DPOTrainer). Although this isn't as memory efficient as loading two adapters in a single base model (example), it does provide the flexibility to customise the QLoRA config.

I find that with these settings MT-Bench yields a score of 7.212, which is ~0.1 lower than zephyr-7b-beta and could likely be improved with a bit more tuning of hparams.

huggingface / alignment-handbook

Apply quantization during DPO QLoRA #115