Has anyone reproduced the results in the paper quantitatively?

binzhwang commented 1 month ago

I reproduced the experiment using Llama-3-Instruct-8B. The reproduction process strictly followed the official README.md, and the dataset and model were downloaded according to the config.yaml. I evaluated the reproduced model on AlpacaEval, following the config.yaml. The evaluation results on AlpacaEval are as follows: Length Control: 41.82； win_rate: 38.48; avg_length: 1862

xiamengzhou commented 1 month ago

Hi @binzhwang,

For your reference, we tested the checkpoint princeton-nlp/Llama-3-Instruct-8B-SimPO twice during our development. The first time we got LC: 44.9, win_rate: 39.8, and the second time we got LC: 44.7, win_rate: 40.5, and it seems that there will be a ~1 variance on win_rate. But still the results you get seems to be lower than what we have, and here are a few steps you can take to verify the issue:

We suggest that you try running evaluation with our released checkpoint to see if you could fully reproduce the results, if not, likely it's an AE2 issue
If you get to reproduce the results with our released checkpoints, it might be a training issue. In that case, could you try to run DPO and see if it matches our released results? It would also be great if you could share your training curves, and we would like to help take a look and see what the possible issue could be.

Hope it helps, and happy to help with further questions!

binzhwang commented 4 weeks ago

Thank you for your reply!

As you suggested，

I evaluated the released checkpoint with alpaca_eval three times. The results are as follows: LC: 43.99, win_rate: 40.13; LC: 43.99, win_rate: 40.65; LC: 43.58, win_rate: 40.27. Therefore, it appears that the evaluation process is not the issue.
I reproduced the DPO of Llama-3-Instruct-8B using your released code. I changed the 'SimpoTrainner' to 'DPOTrainer' from trl and the dataset preprocess code has not changed. The beta=0.05, the lr=1.0e-6, and the epoch=1.0. The result I reproduced is LC: 34.57, win_rate: 33.94, length: 1954.
I have another question. The code you released may have a bug that results in a double <|begin_of_text|> and a double <|eot_id|>. You can see this issue in your preprocessing code and preprocess code of DPOTrainner from trl . I removed the duplicated preprocess code, but there still existed the mismatch between your released results and reproduced by me. I'd like to share the training curve with you. The gray line represents the SimPO version where the result is Length Control: 41.82； win_rate: 38.48; avg_length: 1862, as reproduced by me. And the purple line named 'llama-3-8b-instruct-dpo' is DPO.

eugene-yh commented 3 weeks ago

I also tried to reproduce the results for the llama-8b instruct setting. My DPO's score is almost identical to yours (LC=34.95 WR=33.48), around 4-5% off from the released DPO checkpoint. I guess the reason is that the authors might have used a slightly stronger version of llama-8b-instruct as I suggested here ( #31 ). However, this is only my humble speculation which I hope the authors can help me to resolve.

glorgao commented 2 weeks ago

@eugene-yh Hi! I'm also working on reproducing the results on Base and Instruct models. However, I'm encountering some challenges with the evaluation process. Could you help me or share your evaluation setup?

Here's how I evaluate the trained model:

Create a new conda environment and install vllm, alpaca-eval, and other necessary packages.
Move the alpaca_eval configuration file and templates to virtual_env_name/lib/xxx/models_configs.
Run the command: evaluate_from_model --model_configs Llama-3-Base-8B-SFT-SimPO-Reproduced.yaml, with the .yaml file modified to: model_name: ./outputs/llama-3-8b-base-simpo.

However, I've encountered the following issues:

vllm does not recognize the setting "torch_dtype: bfloat16".
alpaca_eval does not accept the parameter "do_sample".

I'm concerned that modifying these settings might impact the evaluation results. Could you please share your versions of the following packages (as they were not specified by the SimPO authors)?

vllm
alpaca_eval Alternatively, if you have a different evaluation approach that has worked for you, I would be greatly grateful if you could share that.

Thanks in advance for your help! @eugene-yh

glorgao commented 2 weeks ago

Hey,

I successfully replicated the paper results of Llama-3-Instruct-8B-SimPO using the released checkpoint and command: alpaca_eval evaluate_from_model --model_configs 'Llama-3-Instruct-8B-SimPO'. I got LC: 44.65 and WR: 40.53, which suggests my evaluation setup is correct: I removed two settings (torch_dtype: 'bfloat16' and do_sample: True) because they weren’t supported by the version of vllm and alpaca_eval I'm using.

Results of the retrained model of Llama-3-Instruct-8B-SimPO were LC: 39.52 and WR: 35.11, and for Llama-3-Base-8B-SimPO, they were LC: 17.67 and WR: 16.42.

The big difference in results might be due to my software setup, so I’m sharing the packages info just for your information:) req.txt

xiamengzhou commented 4 days ago

@glorgao The degraded performance on your trained model might stem from the inconsistency between training and evaluation in BOS tokens.

yumeng5 commented 10 hours ago

Hi @glorgao @eugene-yh

We have just updated our trainer scripts to hopefully eliminate the issues caused by using different package versions (for example, we noticed that different trl package versions handle BOS tokens differently). Feel free to try out our updated training scripts.

Also, as Mengzhou mentioned above, redundant BOS tokens might be added during evaluation (e.g., AlpacaEval 2 and Arena-Hard) depending on the specific Llama3-Instruct tokenizer version used so it's recommended to check for that as well.

Best, Yu

princeton-nlp / SimPO

Has anyone reproduced the results in the paper quantitatively? #21