cannot replicate DPO results of zephyr

AlexiaJM commented 4 months ago

I cannot replicate the DPO results for zephyr.

I use a modified version of config_full.yaml with the only difference being that I set gradient_accumulation_steps: 4 instead of 2 because I use 4 GPUs. I'm using all the correct versions of software as in setup.py. I resumed twice during training and its something that is inevitable with our cluster, but if resuming set seeds properly, this should not be a problem.

Code: `ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=4 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full4.yaml

The results is here: https://huggingface.co/AlexiaJM/zephyr-7b-dpo-full-repnew. As you can see the numbers are slightly off from https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full but not significantly.

These are the results from the MT-Bench:

########## First turn ########## score model turn zephyr-7b-dpo-full 1 7.81250 zephyr-7b-dpo-full-repnew 1 7.5375

########## Second turn ########## score model turn zephyr-7b-dpo-full 2 7.322785 zephyr-7b-dpo-full-repnew 2 7.125

########## Average ########## score model zephyr-7b-dpo-full 7.569182 zephyr-7b-dpo-full-repnew 7.33125

AlexiaJM commented 4 months ago

xijiu9 commented 4 months ago

I met similar question with you: My model gives

########## First turn ########## score model turn
zephyr-7b-dpo-full-self-ref 1 7.79375 zephyr-7b-dpo-full-self 1 7.43750 zephyr-7b-sft-full-self-ref 1 6.63125 zephyr-7b-sft-full-self 1 6.39375

########## Second turn ########## score model turn
zephyr-7b-dpo-full-self-ref 2 7.35000 zephyr-7b-dpo-full-self 2 6.69375 zephyr-7b-sft-full-self-ref 2 5.97500 zephyr-7b-sft-full-self 2 5.61250

########## Average ########## score model
zephyr-7b-dpo-full-self-ref 7.571875 zephyr-7b-dpo-full-self 7.065625 zephyr-7b-sft-full-self-ref 6.303125 zephyr-7b-sft-full-self 6.003125

xijiu9 commented 4 months ago

where the models ends with '-ref' is the official checkpoint from huggingface, and models ends with '-self' are my models when reproducing the experiment.

EriChen0615 commented 4 months ago

Experiencing similar issues here. The replicated model scores about 0.3 lower than the published zephyr-7b-dpo-full.

Reported in the blog post. 1 Zephyr-7B-sft 6.24 from HF tutorial 2 Zephyr-7b-dpo-full 7.50 from HF tutorial

Using FastChat's inference script with empty system message 3 Zephyr-7B-sft 6.42
4 Zephyr-7b-dpo-full 7.48

Trained with the repo
5 Zephyr-7b-dpo-beta=0.01 7.16

In addition, the training statistics when training Zephyr-7B with beta=0.01 are very different from what's published. I checked against the published DPO training statistics (at epoch 0.84) of Zephyr-7b-dpo-full. Below I list the value in our training and in parenthesis I list the reported values.

Training Loss: 0.6008 (0.4853)
Validation Loss: 0.6014 (0.5050)
Rewards/accuracies: 0.3313 (0.7539)
Rewards/margins: 0.3653 (1.0156)

The diff in reward/accuracies looks alarming. Any idea what could be the cause @lewtun?

@AlexiaJM @xijiu9 let me know if you have any progress in replicating!

Best, Eric

gxxu-ml commented 2 months ago

reward accuracy of 0.33 doesn't seem reasonable at epoch 0.84. While there's is still a rewards/margins of 0.36? even though model is way more likely to guess wrong than guess right?

huggingface / alignment-handbook

cannot replicate DPO results of zephyr #124