Open wlhgtc opened 8 months ago
Perhaps related, I compared two full DPO-trained checkpoints HuggingFaceH4/zephyr-7b-beta and alignment-handbook/zephyr-7b-dpo-full. The MT-bench results seem to be different as well. HuggingFaceH4/zephyr-7b-beta match the results in the paper, and alignment-handbook/zephyr-7b-dpo-full does not. See https://github.com/huggingface/alignment-handbook/issues/44
Could you please clarify where the 7.43 MT-bench score for LoRA-DPO was reported? As far as I know, the models reported from the technical report didn't use LoRA. They mentioned:
We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods.
The MT-bench score of full DPO is 7.34 (Table 1).
Perhaps related, I compared two full DPO-trained checkpoints HuggingFaceH4/zephyr-7b-beta and alignment-handbook/zephyr-7b-dpo-full. The MT-bench results seem to be different as well. HuggingFaceH4/zephyr-7b-beta match the results in the paper, and alignment-handbook/zephyr-7b-dpo-full does not. See #44
Could you please clarify where the 7.43 MT-bench score for LoRA-DPO was reported? As far as I know, the models reported from the technical report didn't use LoRA. They mentioned:
We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods.
The MT-bench score of full DPO is 7.34 (Table 1).
Yeah, it appears we obtained the same results, regardless of whether we used -full or -lora. I also found a model named Zephyr-7b-α in zephyr-7b-beta, which has a MT-Bench score of 6.88, nearly identical to ours. Perhaps you could evaluate zephyr-7b-sft-full and check if there is any difference at the sft stage.
@liutianlin0121 @wlhgtc ,
I ran the official model weights: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
On MT-Bench, this is the result based on the current GPT-4 API call (I named the model zep-hf).
########## First turn ##########
score
model turn
zep-hf 1 7.10625
########## Second turn ##########
score
model turn
zep-hf 2 6.45
########## Average ##########
score
model
zep-hf 6.778125
Seems like GPT4 API is pretty unstable now?
@liutianlin0121 @wlhgtc ,
I ran the official model weights: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
On MT-Bench, this is the result based on the current GPT-4 API call (I named the model zep-hf).
########## First turn ########## score model turn zep-hf 1 7.10625 ########## Second turn ########## score model turn zep-hf 2 6.45 ########## Average ########## score model zep-hf 6.778125
Seems like GPT4 API is pretty unstable now?
Edit: To anyone that wants to replicate MT-Bench, just note that it will cost you around $5 USD to run one whole evaluation for a single model.
Thanks for all your questions and detailed analysis, there are a number of different things to address here.
The official zephyr-7b-beta
model used full training. We provide LoRA training configs as an example for lower resource machines, but we do not expect the LoRA runs to achieve parity with the full finetune.
I would recommend targeting all linear layers with LoRA rather than the subset we have chosen in the config files. In addition, there is probably a better combination of hyperparameters as these were copied from the full training config. A rule of thumb is to use 100x learning rate when using LoRA, I do not know if this is valid for the DPO step.
There appears to be a regression somewhere when comparing the original zephyr-7b-beta
model and the models trained in the handbook. There are a number of potential sources of this regression which we have not managed to pin down yet.
Here are our ideas so far:
We still need to investigate this, but we will keep you up to date as we do so. This may take several weeks.
Thanks all! This is very helpful.
In light of @timothylimyl 's result (running the original zephyr-7b-beta
on MT-bench), it seems that the difference of GPT4 API is a contributing factor. About 2 weeks back, I ran the same MT-bench eval as @timothylimyl did, and at that time the MT-bench score of the original zephyr-7b-beta
is 7.37, which closely matched with the score reported in the paper. A few days ago, I ran the handbook version of zephyr-7b-beta
, and the MT-bench score is 7.09. See https://github.com/huggingface/alignment-handbook/issues/44
There appears to be a regression somewhere when comparing the original zephyr-7b-beta model and the models trained in the handbook.
May I ask if there is any other evidence (in addition to MT-bench scores) that suggests the performance regression?
We identified the regression using MT bench scores as well. We are rerunning the evals and other exps internally to try and get to the root cause.
Thanks for all your questions and detailed analysis, there are a number of different things to address here.
LoRA training:
The official
zephyr-7b-beta
model used full training. We provide LoRA training configs as an example for lower resource machines, but we do not expect the LoRA runs to achieve parity with the full finetune.I would recommend targeting all linear layers with LoRA rather than the subset we have chosen in the config files. In addition, there is probably a better combination of hyperparameters as these were copied from the full training config. A rule of thumb is to use 100x learning rate when using LoRA, I do not know if this is valid for the DPO step.
MT Bench scores
There appears to be a regression somewhere when comparing the original
zephyr-7b-beta
model and the models trained in the handbook. There are a number of potential sources of this regression which we have not managed to pin down yet. Here are our ideas so far:
- A regression in TRL's SFT trainer
- A regression in TRL's DPO trainer
- The GPT-4 model used in MT Bench has changed since our last evals
- A mismatch better our internal codebase (h4) and the alignment handbook
We still need to investigate this, but we will keep you up to date as we do so. This may take several weeks.
Thanks for your quickly response.
According to my understanding: the model zephyr-7b-dpo-full
is identical to zephyr-7b-beta
. They are expected to yield approximately 7.34 in the MT-Bench. However, zephyr-7b-dpo-lora
is only a sample and its results might be less than 7.43 due to the influence of hyperparameters or other factors.
Did I get it right? @edbeeching
Hi @wlhgtc
In fact zephyr-7b-beta
was trained using an internal codebase. zephyr-7b-dpo-full
was trained using code from this repo, with the same parameters as the internal codebase. This repo contains a subset of the code from our internal codebase.
zephyr-7b-dpo-lora
has hyper-parameters that are the same as the full finetune, but includes LoRA with peft adapters.
There is no gaurantee that zephyr-7b-beta
and zephyr-7b-dpo-full
have the same performance on MT-Bench. We aim for them to be the same, but there may be a regression somewhere.
Hello everyone, thank you for raising this issue - we hope to get to the bottom of the discrepancy soon!
In the meantime, I double-checked if the source of the diff could be from GPT-4 in MT-Bench and can confirm that it's still pretty stable. To be safe, I reran evals for openchat-3.5
and vicuna-7b-v1.5
so we can compare to the official leaderboard (new evals have a date suffix of 2023-11-23
):
########## Average ##########
score
model
openchat-3.5_2023-11-23 7.825000
openchat-3.5 7.81
zephyr-7b-beta_2023-11-23 7.358491
zephyr-7b-beta 7.34
vicuna-7b-v1.5 6.17
vicuna-7b-v1.5_2023-11-23 6.034375
With this eliminated, the next thing is to check for the other points in @edbeeching's post above (regressions or issues with porting the code). We'll report back here once we have some new information!
@lewtun Thanks so much! Just to make sure that I understand correctly: I suppose both zephyr-7b-beta_2023-11-23
and zephyr-7b-beta
in the table above are based on the original model (HuggingFaceH4/zephyr-7b-beta), and not the handbook model (alignment-handbook/zephyr-7b-dpo-full), right?
@lewtun Thanks so much! Just to make sure that I understand correctly: I suppose both
zephyr-7b-beta_2023-11-23
andzephyr-7b-beta
in the table above are based on the original model (HuggingFaceH4/zephyr-7b-beta), and not the handbook model (alignment-handbook/zephyr-7b-dpo-full), right?
Yes, that's correct. Once we trace the source of the discrepancy we'll update all the models / configs where needed.
Hello everyone, thank you for raising this issue - we hope to get to the bottom of the discrepancy soon!
In the meantime, I double-checked if the source of the diff could be from GPT-4 in MT-Bench and can confirm that it's still pretty stable. To be safe, I reran evals for
openchat-3.5
andvicuna-7b-v1.5
so we can compare to the official leaderboard (new evals have a date suffix of2023-11-23
):########## Average ########## score model openchat-3.5_2023-11-23 7.825000 openchat-3.5 7.81 zephyr-7b-beta_2023-11-23 7.358491 zephyr-7b-beta 7.34 vicuna-7b-v1.5 6.17 vicuna-7b-v1.5_2023-11-23 6.034375
With this eliminated, the next thing is to check for the other points in @edbeeching's post above (regressions or issues with porting the code). We'll report back here once we have some new information!
Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.
Another issue signalling regression due to porting of code (your internal to alignment handbook) is that I could not seem to train the model to be anywhere close. I went ahead and try using the official sft-full model and ran DPO on it as per the recipe and gotten this result:
########## First turn ##########
score
model turn
zep-own 1 6.6875
########## Second turn ##########
score
model turn
zep-own 2 5.8625
########## Average ##########
score
model
zep-own 6.275
You can find my replication model for sft-full to dpo-full: https://huggingface.co/timlim123/zephyr-7b-dpo-full
Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.
Hello @timothylimyl can you share which commit of the FastChat
repo you are using to compute the MT bench scores?
I just noticed there is a bug with the chat templates introduced 6h ago (https://github.com/lm-sys/FastChat/pull/2725) which I've now proposed a fix for in https://github.com/lm-sys/FastChat/pull/2727
It's possible that your generations are using the wrong chat template (Hermes2 instead of Zephyr) and that's why your scores are so much lower. A quick way to test this is to see what happens if you try to chat with your model via their CLI:
python3 -m fastchat.serve.cli --model-path timlim123/zephyr-7b-dpo-full --debug
Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.
Yes, it seems there are two remaining sources of discrepancy:
We're trying to pin down the root cause and will report back here when we have some results.
Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.
Can you please clarify exactly what you're comparing against?
Hello @timothylimyl can you share which commit of the
FastChat
repo you are using to compute the MT bench scores?I just noticed there is a bug with the chat templates introduced 6h ago (lm-sys/FastChat#2725) which I've now proposed a fix for in lm-sys/FastChat#2727
It's possible that your generations are using the wrong chat template (Hermes2 instead of Zephyr) and that's why your scores are so much lower. A quick way to test this is to see what happens if you try to chat with your model via their CLI:
python3 -m fastchat.serve.cli --model-path timlim123/zephyr-7b-dpo-full --debug
Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.
Yes, it seems there are two remaining sources of discrepancy:
- either a regression in the TRL trainers for SFT/DPO occurred in the time between our paper & the handbook release
- a bug was introduced when porting our code from the internal codebase to the handbook
We're trying to pin down the root cause and will report back here when we have some results.
Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.
Can you please clarify exactly what you're comparing against?
Hi @lewtun ,
The FastChat repo I used is the current master (as of 23/11/23).
name = "fschat"
version = "0.2.32"
When running under FastChat inference debug mode for the dpo-full model that I trained, the responses are good:
In regards the the loss discrepancy I am mentioning, the loss value posted cannot be found anywhere in the table of results (Link: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta):
@lewtun An add-on query: It seems that the system prompts are left blank? Is this done so internally too?
@timothylimyl
In regards the the loss discrepancy I am mentioning, the loss value posted cannot be found anywhere in the table of results
You mean the loss value 0.7496 reported in the "Training and evaluation" section did not occur in the table in the "Training results" section, or?
If so, this is because the loss value 0.7496 is the final loss after the training, and the table in the "Training results" reports evaluation every 100 steps. The number of total steps is not divisible by 100 in general, so you cannot see the last step's loss in the table.
Take your experiment https://huggingface.co/timlim123/zephyr-7b-dpo-full as an example. The final loss reported just below "zephyr-7b-dpo-full" is 0.7337, and it is not in the table. However, I checked your tensorboard record, and the 0.7337 shows up there. Note that there are 5811 steps in total, which is not divisible by 100. See the screenshot attached, where I circled the final loss and the time step where this loss occured.
Hope this helps!
@liutianlin0121
Thanks for clearing that up! I wrongly assumed that the loss reported would be the lowest validation loss, should had check the logs.
@lewtun
Is the official model deployed on hugging face then:
a) The final checkpoint/weight b) The lowest training loss checkpoint c) The lowest validation loss checkpoint
@lewtun seems like on the new pull of fastchat (27/11/2023), the results are further degraded for the official mode zephry-beta modell:
########## First turn ##########
score
model turn
zep-hf-fixed 1 7.1375
########## Second turn ##########
score
model turn
zep-hf-fixed 2 6.175
########## Average ##########
score
model
zep-hf-fixed 6.65625
@lewtun
Is the official model deployed on hugging face then:
a) The final checkpoint/weight b) The lowest training loss checkpoint c) The lowest validation loss checkpoint
Lewis may want to correct me, but I believe that its a) the final checkpoint. The reason is that the training loss and validation loss are not reliable indicators of the DPO-trained model's downstream performance, as mentioned in the paper:
In the process of training ZEPHYR-7B we observed that after one epoch of DPO training, the model would strongly overfit, as indicated by perfect training set accuracies in Figure 3. Surprisingly, this did not harm downstream performance on MT-Bench and AlpacaEval; as shown in Figure 3, the strongest model was obtained with one epoch of SFT followed by three epochs of DPO.
Given this, selecting the checkpoints based on train or validation loss seems to be no-more reliable than just use the last checkpoint.
@lewtun seems like on the new pull of fastchat (27/11/2023), the results are further degraded for the official mode zephry-beta modell:
########## First turn ########## score model turn zep-hf-fixed 1 7.1375 ########## Second turn ########## score model turn zep-hf-fixed 2 6.175 ########## Average ########## score model zep-hf-fixed 6.65625
Please ignore this, I realised that FastChat MT-Bench is not smart enough to detect the correct model, and uses the name instead, so I just needed to name the model zephyr
instead of zep
:
########## First turn ##########
score
model turn
zephyr 1 7.68125
########## Second turn ##########
score
model turn
zephyr 2 6.975
########## Average ##########
score
model
zephyr 7.328125
**update, following the current (29/11/2023) alignment handbook repo, I can confirm that I cannot replicate the results of sft-full to dpo-full:
########## First turn ##########
score
model turn
zephyr-beta-own 1 7.25
########## Second turn ##########
score
model turn
zephyr-beta-own 2 6.60625
########## Average ##########
score
model
zephyr-beta-own 6.928125
@timothylimyl Thanks for verifying this! Really curious about what causes the regression...
@edbeeching @lewtun May I ask if there are updates regarding the regression? Even a partial solution or some intuitions would be tremendously helpful. Thanks!
Hi @liutianlin0121 , sorry for the lack of updates. I have been cautiously working through PRs on our internal codebase to identify the root cause. I can confirm that I am able to replicate the zephyr beta results internally with an older version of our internal codebase and this morning I have identified the PR that leads to the regression. I now need to find the specific changes within that PR.
Hi @edbeeching ! Thanks a lot! I went ahead and did some debugging myself. I finished a round of training of zephyr-7b using the handbook recipe + 2 changes, and I'm happy to report that I can reproduce the MT-bench scores now.
My zephyr model on huggingface hub: link
The MTbench evaluation on colab: link. The MT-bench score is 7.390625, matching the original result reported in the paper.
In the above figure, "alignment-handbook-zephyr-7b-dpo-full" is the Zephyr model provided here. The "zephyr-7b-dpo-full-debug-regression" is the model from my re-run here.
The two small changes I made are the following:
Instead of using a global batch size 64, I used a global batch 32. The global batch size 32 is consistent with the number reported in the paper; it is also used in the official model. However, the global batch size 64 was used to train the handbook model. So I switched back to 32.
Instead of using the SFT checkpoint alignment-handbook/zephyr-7b-sft-full
, I used the SFT checkpoint HuggingFaceH4/mistral-7b-sft-beta
.
After these two edits, the full DPO training recipe is the following:
# Model arguments
model_name_or_path: HuggingFaceH4/mistral-7b-sft-beta
# Data training arguments
# For definitions, see: src/h4/training/config.py
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
preprocessing_num_workers: 12
# DPOTrainer arguments
bf16: true
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 500
gradient_accumulation_steps: 1
gradient_checkpointing: true
hub_model_id: zephyr-7b-dpo-full-debug-regression
learning_rate: 5.0e-7
log_level: info
logging_steps: 10
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 3
optim: rmsprop
output_dir: data/zephyr-7b-dpo-full-debug-regression
per_device_train_batch_size: 4 # With 8 gpus, the global batch size is 32.
per_device_eval_batch_size: 4
push_to_hub: true
save_strategy: "no"
save_total_limit: null
seed: 42
warmup_ratio: 0.1
See the tensorboard record for more details.
For now, I am not sure if both changes are needed. But I suspect that the SFT checkpoint is the primary (if not the only) factor.
@liutianlin0121
Did you run MT-Bench on HuggingFaceH4/mistral-7b-sft-beta
?
I am curious on whether does the SFT model itself already achieve a high result on MT-Bench.
@liutianlin0121
Did you run MT-Bench on
HuggingFaceH4/mistral-7b-sft-beta
?I am curious on whether does the SFT model itself already achieve a high result on MT-Bench.
That's a good idea. I've not tried that yet.
DPO does help improve performance on MT-Bench, but I can't achieve a score of 7.43. Is there any difference between the model described in your paper and the model available on your homepage? Or could it be the difference between the full and LORA?
By the way, I truly love the "yaml style" argument parser; it's clear and elegant! @edbeeching @lewtun