Mismatch of results - Githubissues

AGTSAAA commented 1 month ago

Hi, I use the following command to run the code

CUDA_VISIBLE_DEVICES=3,5,6 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run.py training_configs/mistral-7b-base-simpo.yaml

I found that it occupies 70GB per A100 card, even when the batch size is set to 1.

Could you help me with this? Why does it occupy so much GPU memory even with deepspeed_zero3?"

yumeng5 commented 1 month ago

Hi,

I suppose you are using 3 GPUs for running the code. The original scripts we used were set for 8 GPUs, and the per device batch size can be set to 4 with 80GB GPUs. If you have fewer GPUs, it's normal for the GPU memory consumption to be higher. Additionally, it's recommended to use either 2 or 4 GPUs instead of 3 because achieving a global batch size of 128 is difficult with 3 GPUs.

Best, Yu

AGTSAAA commented 1 month ago

Thank you for your response. I have another question about the mismatch of results.

Why are results reported in your paper different from open LLM leadbord?

yumeng5 commented 1 month ago

Hi,

This is because our DPO checkpoint under the Mistral-base setup is different from the Zephyr checkpoint.

There was an update of the Zephyr training recipe in hyperparameters such as beta and num_train_epochs ~5 months ago (see here). However, the model checkpoint was not updated as far as I know.

Therefore, we trained our own DPO model under the Mistral-base setup with the updated hyperparameters of the Zephyr training recipe. We found that this new DPO model indeed led to better instruction-following scores (our DPO checkpoint has 15.1 LC win rate and 12.5 raw win rate on AlpacaEval 2 as reported in our paper, while the official Zephyr checkpoint has 13.2 LC win rate and 11.0 raw win rate according to the AlpacaEval 2 Leaderboard.

It's normal for instruction-following scores not to always correlate positively with downstream task scores on the Open LLM Leaderboard. We believe this tradeoff is mostly due to different hyperparameter choices, and it applies similarly to all the baselines we compared against. Given our focus on preference optimization, we prioritize instruction-following benchmarks over Open LLM Leaderboard tasks.

Let us know if you have further questions.

Best, Yu

AGTSAAA commented 1 month ago

Thanks for your reply. How did you test open LLM leadbord?

Did you submit your model to https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard or

use https://github.com/EleutherAI/lm-evaluation-harness by yourself?

xiamengzhou commented 1 month ago

@AGTSAAA We use the lm-evaluation-harness for evaluation and follow the settings in open-llm-leaderboard!

AGTSAAA commented 1 month ago

@xiamengzhou Thanks for your reply!

Could you please tell me which version of lm-evaluation-harness did you use?

xiamengzhou commented 1 month ago

We used this version as the open-llm-leaderboard suggested.

princeton-nlp / SimPO

Mismatch of results #9