huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.2k stars 357 forks source link

Apply quantization during DPO QLoRA #115

Closed lewtun closed 5 months ago

lewtun commented 5 months ago

This PR fixes a bug where we weren't quantising the base model with QLoRA during DPO and thus were actually doing LoRA instead.

Now we first quantise the base model in 4bit and load the SFT adapter (which later gets merged within the DPOTrainer). Although this isn't as memory efficient as loading two adapters in a single base model (example), it does provide the flexibility to customise the QLoRA config.

I find that with these settings MT-Bench yields a score of 7.212, which is ~0.1 lower than zephyr-7b-beta and could likely be improved with a bit more tuning of hparams.

HuggingFaceDocBuilderDev commented 5 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.