Reproducing of Lora Model Result on MT-Bench

wlhgtc commented 8 months ago

Recently, I attempted to fit the DPO on my own dataset. Initially, I tried to reproduce the results of your LORA model( 7.43 on MT-Bench). However, I encountered some issues. Despite using all your parameters and data, here are my results on MT-Bench:	Model	MT-Bench
Zephyr-SFT-Lora-Own	6.37
Zephyr-DPO-Lora-Own	6.95

Then, I downloaded your models from here, and the results were nearly the same as mine.	Model	MT-Bench
Zephyr-SFT-Lora	6.4
Zephyr-DPO-Lora	6.93

DPO does help improve performance on MT-Bench, but I can't achieve a score of 7.43. Is there any difference between the model described in your paper and the model available on your homepage? Or could it be the difference between the full and LORA?

By the way, I truly love the "yaml style" argument parser; it's clear and elegant! @edbeeching @lewtun

liutianlin0121 commented 8 months ago

Perhaps related, I compared two full DPO-trained checkpoints HuggingFaceH4/zephyr-7b-beta and alignment-handbook/zephyr-7b-dpo-full. The MT-bench results seem to be different as well. HuggingFaceH4/zephyr-7b-beta match the results in the paper, and alignment-handbook/zephyr-7b-dpo-full does not. See https://github.com/huggingface/alignment-handbook/issues/44

Could you please clarify where the 7.43 MT-bench score for LoRA-DPO was reported? As far as I know, the models reported from the technical report didn't use LoRA. They mentioned:

We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods.

The MT-bench score of full DPO is 7.34 (Table 1).

wlhgtc commented 8 months ago

Perhaps related, I compared two full DPO-trained checkpoints HuggingFaceH4/zephyr-7b-beta and alignment-handbook/zephyr-7b-dpo-full. The MT-bench results seem to be different as well. HuggingFaceH4/zephyr-7b-beta match the results in the paper, and alignment-handbook/zephyr-7b-dpo-full does not. See #44

Could you please clarify where the 7.43 MT-bench score for LoRA-DPO was reported? As far as I know, the models reported from the technical report didn't use LoRA. They mentioned:

We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods.

The MT-bench score of full DPO is 7.34 (Table 1).

Yeah, it appears we obtained the same results, regardless of whether we used -full or -lora. I also found a model named Zephyr-7b-α in zephyr-7b-beta, which has a MT-Bench score of 6.88, nearly identical to ours. Perhaps you could evaluate zephyr-7b-sft-full and check if there is any difference at the sft stage.

timothylimyl commented 8 months ago

@liutianlin0121 @wlhgtc ,

I ran the official model weights: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

On MT-Bench, this is the result based on the current GPT-4 API call (I named the model zep-hf).

########## First turn ##########
               score
model  turn         
zep-hf 1     7.10625

########## Second turn ##########
             score
model  turn       
zep-hf 2      6.45

########## Average ##########
           score
model           
zep-hf  6.778125

Seems like GPT4 API is pretty unstable now?

timothylimyl commented 8 months ago

@liutianlin0121 @wlhgtc ,

I ran the official model weights: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

On MT-Bench, this is the result based on the current GPT-4 API call (I named the model zep-hf).
########## First turn ##########
               score
model  turn         
zep-hf 1     7.10625

########## Second turn ##########
             score
model  turn       
zep-hf 2      6.45

########## Average ##########
           score
model           
zep-hf  6.778125
Seems like GPT4 API is pretty unstable now?

Edit: To anyone that wants to replicate MT-Bench, just note that it will cost you around $5 USD to run one whole evaluation for a single model.

edbeeching commented 8 months ago

Thanks for all your questions and detailed analysis, there are a number of different things to address here.

LoRA training:

The official zephyr-7b-beta model used full training. We provide LoRA training configs as an example for lower resource machines, but we do not expect the LoRA runs to achieve parity with the full finetune.

I would recommend targeting all linear layers with LoRA rather than the subset we have chosen in the config files. In addition, there is probably a better combination of hyperparameters as these were copied from the full training config. A rule of thumb is to use 100x learning rate when using LoRA, I do not know if this is valid for the DPO step.

MT Bench scores

There appears to be a regression somewhere when comparing the original zephyr-7b-beta model and the models trained in the handbook. There are a number of potential sources of this regression which we have not managed to pin down yet. Here are our ideas so far:

A regression in TRL's SFT trainer
A regression in TRL's DPO trainer
The GPT-4 model used in MT Bench has changed since our last evals
A mismatch better our internal codebase (h4) and the alignment handbook

We still need to investigate this, but we will keep you up to date as we do so. This may take several weeks.

liutianlin0121 commented 8 months ago

Thanks all! This is very helpful.

In light of @timothylimyl 's result (running the original zephyr-7b-beta on MT-bench), it seems that the difference of GPT4 API is a contributing factor. About 2 weeks back, I ran the same MT-bench eval as @timothylimyl did, and at that time the MT-bench score of the original zephyr-7b-beta is 7.37, which closely matched with the score reported in the paper. A few days ago, I ran the handbook version of zephyr-7b-beta, and the MT-bench score is 7.09. See https://github.com/huggingface/alignment-handbook/issues/44

There appears to be a regression somewhere when comparing the original zephyr-7b-beta model and the models trained in the handbook.

May I ask if there is any other evidence (in addition to MT-bench scores) that suggests the performance regression?

edbeeching commented 8 months ago

We identified the regression using MT bench scores as well. We are rerunning the evals and other exps internally to try and get to the root cause.

wlhgtc commented 8 months ago

Thanks for all your questions and detailed analysis, there are a number of different things to address here.

LoRA training:

The official zephyr-7b-beta model used full training. We provide LoRA training configs as an example for lower resource machines, but we do not expect the LoRA runs to achieve parity with the full finetune.

I would recommend targeting all linear layers with LoRA rather than the subset we have chosen in the config files. In addition, there is probably a better combination of hyperparameters as these were copied from the full training config. A rule of thumb is to use 100x learning rate when using LoRA, I do not know if this is valid for the DPO step.

MT Bench scores

There appears to be a regression somewhere when comparing the original zephyr-7b-beta model and the models trained in the handbook. There are a number of potential sources of this regression which we have not managed to pin down yet. Here are our ideas so far:

A regression in TRL's SFT trainer

A regression in TRL's DPO trainer

The GPT-4 model used in MT Bench has changed since our last evals

A mismatch better our internal codebase (h4) and the alignment handbook

We still need to investigate this, but we will keep you up to date as we do so. This may take several weeks.

Thanks for your quickly response. According to my understanding： the model zephyr-7b-dpo-full is identical to zephyr-7b-beta. They are expected to yield approximately 7.34 in the MT-Bench. However, zephyr-7b-dpo-lora is only a sample and its results might be less than 7.43 due to the influence of hyperparameters or other factors. Did I get it right? @edbeeching

edbeeching commented 8 months ago

Hi @wlhgtc In fact zephyr-7b-beta was trained using an internal codebase. zephyr-7b-dpo-full was trained using code from this repo, with the same parameters as the internal codebase. This repo contains a subset of the code from our internal codebase. zephyr-7b-dpo-lora has hyper-parameters that are the same as the full finetune, but includes LoRA with peft adapters.

There is no gaurantee that zephyr-7b-beta and zephyr-7b-dpo-full have the same performance on MT-Bench. We aim for them to be the same, but there may be a regression somewhere.

lewtun commented 8 months ago

Hello everyone, thank you for raising this issue - we hope to get to the bottom of the discrepancy soon!

In the meantime, I double-checked if the source of the diff could be from GPT-4 in MT-Bench and can confirm that it's still pretty stable. To be safe, I reran evals for openchat-3.5 and vicuna-7b-v1.5 so we can compare to the official leaderboard (new evals have a date suffix of 2023-11-23):

########## Average ##########
                                      score
model                                      
openchat-3.5_2023-11-23            7.825000
openchat-3.5                       7.81
zephyr-7b-beta_2023-11-23          7.358491
zephyr-7b-beta                     7.34
vicuna-7b-v1.5                     6.17
vicuna-7b-v1.5_2023-11-23          6.034375

With this eliminated, the next thing is to check for the other points in @edbeeching's post above (regressions or issues with porting the code). We'll report back here once we have some new information!

liutianlin0121 commented 8 months ago

@lewtun Thanks so much! Just to make sure that I understand correctly: I suppose both zephyr-7b-beta_2023-11-23 and zephyr-7b-beta in the table above are based on the original model (HuggingFaceH4/zephyr-7b-beta), and not the handbook model (alignment-handbook/zephyr-7b-dpo-full), right?

lewtun commented 8 months ago

@lewtun Thanks so much! Just to make sure that I understand correctly: I suppose both zephyr-7b-beta_2023-11-23 and zephyr-7b-beta in the table above are based on the original model (HuggingFaceH4/zephyr-7b-beta), and not the handbook model (alignment-handbook/zephyr-7b-dpo-full), right?

Yes, that's correct. Once we trace the source of the discrepancy we'll update all the models / configs where needed.

timothylimyl commented 8 months ago

Hello everyone, thank you for raising this issue - we hope to get to the bottom of the discrepancy soon!

In the meantime, I double-checked if the source of the diff could be from GPT-4 in MT-Bench and can confirm that it's still pretty stable. To be safe, I reran evals for openchat-3.5 and vicuna-7b-v1.5 so we can compare to the official leaderboard (new evals have a date suffix of 2023-11-23):
########## Average ##########
                                      score
model                                      
openchat-3.5_2023-11-23            7.825000
openchat-3.5                       7.81
zephyr-7b-beta_2023-11-23          7.358491
zephyr-7b-beta                     7.34
vicuna-7b-v1.5                     6.17
vicuna-7b-v1.5_2023-11-23          6.034375
With this eliminated, the next thing is to check for the other points in @edbeeching's post above (regressions or issues with porting the code). We'll report back here once we have some new information!

Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.

Another issue signalling regression due to porting of code (your internal to alignment handbook) is that I could not seem to train the model to be anywhere close. I went ahead and try using the official sft-full model and ran DPO on it as per the recipe and gotten this result:

########## First turn ##########
               score
model   turn        
zep-own 1     6.6875

########## Second turn ##########
               score
model   turn        
zep-own 2     5.8625

########## Average ##########
         score
model         
zep-own  6.275

You can find my replication model for sft-full to dpo-full: https://huggingface.co/timlim123/zephyr-7b-dpo-full

Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.

lewtun commented 8 months ago

Hello @timothylimyl can you share which commit of the FastChat repo you are using to compute the MT bench scores?

I just noticed there is a bug with the chat templates introduced 6h ago (https://github.com/lm-sys/FastChat/pull/2725) which I've now proposed a fix for in https://github.com/lm-sys/FastChat/pull/2727

It's possible that your generations are using the wrong chat template (Hermes2 instead of Zephyr) and that's why your scores are so much lower. A quick way to test this is to see what happens if you try to chat with your model via their CLI:

python3 -m fastchat.serve.cli --model-path timlim123/zephyr-7b-dpo-full --debug

Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.

Yes, it seems there are two remaining sources of discrepancy:

either a regression in the TRL trainers for SFT/DPO occurred in the time between our paper & the handbook release
a bug was introduced when porting our code from the internal codebase to the handbook

We're trying to pin down the root cause and will report back here when we have some results.

Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.

Can you please clarify exactly what you're comparing against?

timothylimyl commented 8 months ago

Hello @timothylimyl can you share which commit of the FastChat repo you are using to compute the MT bench scores?

I just noticed there is a bug with the chat templates introduced 6h ago (lm-sys/FastChat#2725) which I've now proposed a fix for in lm-sys/FastChat#2727

It's possible that your generations are using the wrong chat template (Hermes2 instead of Zephyr) and that's why your scores are so much lower. A quick way to test this is to see what happens if you try to chat with your model via their CLI:
python3 -m fastchat.serve.cli --model-path timlim123/zephyr-7b-dpo-full --debug
Please correct me if I am misunderstanding this. Do you mean that the regression in results is caused a different release by hugging face compared to the official paper? The signal for this to me is running the official release zephyr-beta on HF on MT-Bench and observing the result now.

Yes, it seems there are two remaining sources of discrepancy:

either a regression in the TRL trainers for SFT/DPO occurred in the time between our paper & the handbook release

a bug was introduced when porting our code from the internal codebase to the handbook

We're trying to pin down the root cause and will report back here when we have some results.

Lastly, @lewtun, I am confused by the evaluation loss report provided. The loss reported does not match any of the loss value in the training table.

Can you please clarify exactly what you're comparing against?

Hi @lewtun ,

The FastChat repo I used is the current master (as of 23/11/23).

name = "fschat"
version = "0.2.32"

When running under FastChat inference debug mode for the dpo-full model that I trained, the responses are good:

In regards the the loss discrepancy I am mentioning, the loss value posted cannot be found anywhere in the table of results (Link: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta):

@lewtun An add-on query: It seems that the system prompts are left blank? Is this done so internally too?

liutianlin0121 commented 8 months ago

@timothylimyl

In regards the the loss discrepancy I am mentioning, the loss value posted cannot be found anywhere in the table of results

You mean the loss value 0.7496 reported in the "Training and evaluation" section did not occur in the table in the "Training results" section, or?

If so, this is because the loss value 0.7496 is the final loss after the training, and the table in the "Training results" reports evaluation every 100 steps. The number of total steps is not divisible by 100 in general, so you cannot see the last step's loss in the table.

Take your experiment https://huggingface.co/timlim123/zephyr-7b-dpo-full as an example. The final loss reported just below "zephyr-7b-dpo-full" is 0.7337, and it is not in the table. However, I checked your tensorboard record, and the 0.7337 shows up there. Note that there are 5811 steps in total, which is not divisible by 100. See the screenshot attached, where I circled the final loss and the time step where this loss occured.

Hope this helps!

timothylimyl commented 8 months ago

@liutianlin0121

Thanks for clearing that up! I wrongly assumed that the loss reported would be the lowest validation loss, should had check the logs.

@lewtun

Is the official model deployed on hugging face then:

a) The final checkpoint/weight b) The lowest training loss checkpoint c) The lowest validation loss checkpoint

timothylimyl commented 8 months ago

@lewtun seems like on the new pull of fastchat (27/11/2023), the results are further degraded for the official mode zephry-beta modell:

########## First turn ##########
                    score
model        turn        
zep-hf-fixed 1     7.1375

########## Second turn ##########
                   score
model        turn       
zep-hf-fixed 2     6.175

########## Average ##########
                score
model                
zep-hf-fixed  6.65625

liutianlin0121 commented 8 months ago

@lewtun

Is the official model deployed on hugging face then:

a) The final checkpoint/weight b) The lowest training loss checkpoint c) The lowest validation loss checkpoint

Lewis may want to correct me, but I believe that its a) the final checkpoint. The reason is that the training loss and validation loss are not reliable indicators of the DPO-trained model's downstream performance, as mentioned in the paper:

In the process of training ZEPHYR-7B we observed that after one epoch of DPO training, the model would strongly overfit, as indicated by perfect training set accuracies in Figure 3. Surprisingly, this did not harm downstream performance on MT-Bench and AlpacaEval; as shown in Figure 3, the strongest model was obtained with one epoch of SFT followed by three epochs of DPO.

Given this, selecting the checkpoints based on train or validation loss seems to be no-more reliable than just use the last checkpoint.

timothylimyl commented 8 months ago

@lewtun seems like on the new pull of fastchat (27/11/2023), the results are further degraded for the official mode zephry-beta modell:

########## First turn ##########
                    score
model        turn        
zep-hf-fixed 1     7.1375

########## Second turn ##########
                   score
model        turn       
zep-hf-fixed 2     6.175

########## Average ##########
                score
model                
zep-hf-fixed  6.65625

Please ignore this, I realised that FastChat MT-Bench is not smart enough to detect the correct model, and uses the name instead, so I just needed to name the model zephyr instead of zep:

########## First turn ##########
               score
model  turn         
zephyr 1     7.68125

########## Second turn ##########
             score
model  turn       
zephyr 2     6.975

########## Average ##########
           score
model           
zephyr  7.328125

timothylimyl commented 8 months ago

**update, following the current (29/11/2023) alignment handbook repo, I can confirm that I cannot replicate the results of sft-full to dpo-full:

########## First turn ##########
                      score
model           turn       
zephyr-beta-own 1      7.25

########## Second turn ##########
                        score
model           turn         
zephyr-beta-own 2     6.60625

########## Average ##########
                    score
model                    
zephyr-beta-own  6.928125

liutianlin0121 commented 8 months ago

@timothylimyl Thanks for verifying this! Really curious about what causes the regression...

liutianlin0121 commented 7 months ago

@edbeeching @lewtun May I ask if there are updates regarding the regression? Even a partial solution or some intuitions would be tremendously helpful. Thanks!

edbeeching commented 7 months ago

Hi @liutianlin0121 , sorry for the lack of updates. I have been cautiously working through PRs on our internal codebase to identify the root cause. I can confirm that I am able to replicate the zephyr beta results internally with an older version of our internal codebase and this morning I have identified the PR that leads to the regression. I now need to find the specific changes within that PR.

liutianlin0121 commented 7 months ago

Hi @edbeeching ! Thanks a lot! I went ahead and did some debugging myself. I finished a round of training of zephyr-7b using the handbook recipe + 2 changes, and I'm happy to report that I can reproduce the MT-bench scores now.

My zephyr model on huggingface hub: link
The MTbench evaluation on colab: link. The MT-bench score is 7.390625, matching the original result reported in the paper.

newplot

In the above figure, "alignment-handbook-zephyr-7b-dpo-full" is the Zephyr model provided here. The "zephyr-7b-dpo-full-debug-regression" is the model from my re-run here.

The two small changes I made are the following:

Instead of using a global batch size 64, I used a global batch 32. The global batch size 32 is consistent with the number reported in the paper; it is also used in the official model. However, the global batch size 64 was used to train the handbook model. So I switched back to 32.
Instead of using the SFT checkpoint alignment-handbook/zephyr-7b-sft-full, I used the SFT checkpoint HuggingFaceH4/mistral-7b-sft-beta.

After these two edits, the full DPO training recipe is the following:

# Model arguments
model_name_or_path: HuggingFaceH4/mistral-7b-sft-beta

# Data training arguments
# For definitions, see: src/h4/training/config.py
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
preprocessing_num_workers: 12

# DPOTrainer arguments
bf16: true
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 500
gradient_accumulation_steps: 1
gradient_checkpointing: true
hub_model_id: zephyr-7b-dpo-full-debug-regression
learning_rate: 5.0e-7
log_level: info
logging_steps: 10
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 3
optim: rmsprop
output_dir: data/zephyr-7b-dpo-full-debug-regression
per_device_train_batch_size: 4 # With 8 gpus, the global batch size is 32.
per_device_eval_batch_size: 4
push_to_hub: true
save_strategy: "no"
save_total_limit: null
seed: 42
warmup_ratio: 0.1

See the tensorboard record for more details.

For now, I am not sure if both changes are needed. But I suspect that the SFT checkpoint is the primary (if not the only) factor.

timothylimyl commented 7 months ago

@liutianlin0121

Did you run MT-Bench on HuggingFaceH4/mistral-7b-sft-beta?

I am curious on whether does the SFT model itself already achieve a high result on MT-Bench.

liutianlin0121 commented 7 months ago

@liutianlin0121

Did you run MT-Bench on HuggingFaceH4/mistral-7b-sft-beta?

I am curious on whether does the SFT model itself already achieve a high result on MT-Bench.

That's a good idea. I've not tried that yet.

huggingface / alignment-handbook