huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.2k stars 357 forks source link

cannot replicate DPO results of zephyr #124

Open AlexiaJM opened 4 months ago

AlexiaJM commented 4 months ago

I cannot replicate the DPO results for zephyr.

I use a modified version of config_full.yaml with the only difference being that I set gradient_accumulation_steps: 4 instead of 2 because I use 4 GPUs. I'm using all the correct versions of software as in setup.py. I resumed twice during training and its something that is inevitable with our cluster, but if resuming set seeds properly, this should not be a problem.

Code: `ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=4 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full4.yaml

The results is here: https://huggingface.co/AlexiaJM/zephyr-7b-dpo-full-repnew. As you can see the numbers are slightly off from https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full but not significantly.

These are the results from the MT-Bench:

########## First turn ########## score model turn zephyr-7b-dpo-full 1 7.81250 zephyr-7b-dpo-full-repnew 1 7.5375

########## Second turn ########## score model turn zephyr-7b-dpo-full 2 7.322785 zephyr-7b-dpo-full-repnew 2 7.125

########## Average ########## score model zephyr-7b-dpo-full 7.569182 zephyr-7b-dpo-full-repnew 7.33125

AlexiaJM commented 4 months ago

Related to https://github.com/huggingface/alignment-handbook/issues/45

xijiu9 commented 4 months ago

I met similar question with you: My model gives

########## First turn ########## score model turn
zephyr-7b-dpo-full-self-ref 1 7.79375 zephyr-7b-dpo-full-self 1 7.43750 zephyr-7b-sft-full-self-ref 1 6.63125 zephyr-7b-sft-full-self 1 6.39375

########## Second turn ########## score model turn
zephyr-7b-dpo-full-self-ref 2 7.35000 zephyr-7b-dpo-full-self 2 6.69375 zephyr-7b-sft-full-self-ref 2 5.97500 zephyr-7b-sft-full-self 2 5.61250

########## Average ########## score model
zephyr-7b-dpo-full-self-ref 7.571875 zephyr-7b-dpo-full-self 7.065625 zephyr-7b-sft-full-self-ref 6.303125 zephyr-7b-sft-full-self 6.003125

xijiu9 commented 4 months ago

where the models ends with '-ref' is the official checkpoint from huggingface, and models ends with '-self' are my models when reproducing the experiment.

EriChen0615 commented 4 months ago

Experiencing similar issues here. The replicated model scores about 0.3 lower than the published zephyr-7b-dpo-full.

Reported in the blog post. 1 Zephyr-7B-sft 6.24 from HF tutorial 2 Zephyr-7b-dpo-full 7.50 from HF tutorial

Using FastChat's inference script with empty system message 3 Zephyr-7B-sft 6.42
4 Zephyr-7b-dpo-full 7.48

Trained with the repo
5 Zephyr-7b-dpo-beta=0.01 7.16

In addition, the training statistics when training Zephyr-7B with beta=0.01 are very different from what's published. I checked against the published DPO training statistics (at epoch 0.84) of Zephyr-7b-dpo-full. Below I list the value in our training and in parenthesis I list the reported values.

The diff in reward/accuracies looks alarming. Any idea what could be the cause @lewtun?

@AlexiaJM @xijiu9 let me know if you have any progress in replicating!

Best, Eric

gxxu-ml commented 2 months ago

reward accuracy of 0.33 doesn't seem reasonable at epoch 0.84. While there's is still a rewards/margins of 0.36? even though model is way more likely to guess wrong than guess right?