This PR fixes a bug where we weren't quantising the base model with QLoRA during DPO and thus were actually doing LoRA instead.
Now we first quantise the base model in 4bit and load the SFT adapter (which later gets merged within the DPOTrainer). Although this isn't as memory efficient as loading two adapters in a single base model (example), it does provide the flexibility to customise the QLoRA config.
I find that with these settings MT-Bench yields a score of 7.212, which is ~0.1 lower than zephyr-7b-beta and could likely be improved with a bit more tuning of hparams.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This PR fixes a bug where we weren't quantising the base model with QLoRA during DPO and thus were actually doing LoRA instead.
Now we first quantise the base model in 4bit and load the SFT adapter (which later gets merged within the
DPOTrainer
). Although this isn't as memory efficient as loading two adapters in a single base model (example), it does provide the flexibility to customise the QLoRA config.I find that with these settings MT-Bench yields a score of 7.212, which is ~0.1 lower than
zephyr-7b-beta
and could likely be improved with a bit more tuning of hparams.