huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.74k stars 412 forks source link

Self trained zephyr-7b-dpo-qlora MT-bench score dropped to 1.88 #188

Open jltchiu opened 3 months ago

jltchiu commented 3 months ago

Hi, I just followed recipes/zephyr-7b-beta/dpo/config_qlora.yaml and hope to replicate the experiments. I was training on A10G, with 1 gpu, and the only modification I did was reducing the train_batch_size from 4 to 1 (due to memory constraint). However, my output models zephyr-7b-dpo-qlora only has mt-score of 1.88. I also did a mt-score benchmark with the downloaded zephyr-7b-sft-qlora and it had mt-bench score of 6.37 (which seems relatively normal). Does anyone else also have difficulties replicating this dpo experiments with qlora? Or is the batch size a critical difference for training?

jltchiu commented 3 months ago

Update: I use the mt-bench master branch to run the benchmark on 3 models with gpt-4 zephyr-7b-sft-qlora(downloaded) 6.365625 zephyr-7b-dpo-qlora(downloaded) 4.443038 zephyr-7b-dpo-qlora(trained) 1.883648

Even the downloaded qlora dpo model is worse than the sft model, does someone else also observe this?