Open AlexiaJM opened 4 months ago
I met similar question with you: My model gives
########## First turn ##########
score
model turn
zephyr-7b-dpo-full-self-ref 1 7.79375
zephyr-7b-dpo-full-self 1 7.43750
zephyr-7b-sft-full-self-ref 1 6.63125
zephyr-7b-sft-full-self 1 6.39375
########## Second turn ##########
score
model turn
zephyr-7b-dpo-full-self-ref 2 7.35000
zephyr-7b-dpo-full-self 2 6.69375
zephyr-7b-sft-full-self-ref 2 5.97500
zephyr-7b-sft-full-self 2 5.61250
########## Average ##########
score
model
zephyr-7b-dpo-full-self-ref 7.571875
zephyr-7b-dpo-full-self 7.065625
zephyr-7b-sft-full-self-ref 6.303125
zephyr-7b-sft-full-self 6.003125
where the models ends with '-ref' is the official checkpoint from huggingface, and models ends with '-self' are my models when reproducing the experiment.
Experiencing similar issues here. The replicated model scores about 0.3 lower than the published zephyr-7b-dpo-full.
Reported in the blog post. 1 Zephyr-7B-sft 6.24 from HF tutorial 2 Zephyr-7b-dpo-full 7.50 from HF tutorial
Using FastChat's inference script with empty system message
3 Zephyr-7B-sft 6.42
4 Zephyr-7b-dpo-full 7.48
Trained with the repo
5 Zephyr-7b-dpo-beta=0.01 7.16
In addition, the training statistics when training Zephyr-7B with beta=0.01 are very different from what's published. I checked against the published DPO training statistics (at epoch 0.84) of Zephyr-7b-dpo-full. Below I list the value in our training and in parenthesis I list the reported values.
The diff in reward/accuracies looks alarming. Any idea what could be the cause @lewtun?
@AlexiaJM @xijiu9 let me know if you have any progress in replicating!
Best, Eric
reward accuracy of 0.33 doesn't seem reasonable at epoch 0.84. While there's is still a rewards/margins of 0.36? even though model is way more likely to guess wrong than guess right?
I cannot replicate the DPO results for zephyr.
I use a modified version of config_full.yaml with the only difference being that I set gradient_accumulation_steps: 4 instead of 2 because I use 4 GPUs. I'm using all the correct versions of software as in setup.py. I resumed twice during training and its something that is inevitable with our cluster, but if resuming set seeds properly, this should not be a problem.
Code: `ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=4 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full4.yaml
The results is here: https://huggingface.co/AlexiaJM/zephyr-7b-dpo-full-repnew. As you can see the numbers are slightly off from https://huggingface.co/alignment-handbook/zephyr-7b-dpo-full but not significantly.
These are the results from the MT-Bench:
########## First turn ########## score model turn zephyr-7b-dpo-full 1 7.81250 zephyr-7b-dpo-full-repnew 1 7.5375
########## Second turn ########## score model turn zephyr-7b-dpo-full 2 7.322785 zephyr-7b-dpo-full-repnew 2 7.125
########## Average ########## score model zephyr-7b-dpo-full 7.569182 zephyr-7b-dpo-full-repnew 7.33125