Closed AGTSAAA closed 2 days ago
Hi,
I suppose you are using 3 GPUs for running the code. The original scripts we used were set for 8 GPUs, and the per device batch size can be set to 4 with 80GB GPUs. If you have fewer GPUs, it's normal for the GPU memory consumption to be higher. Additionally, it's recommended to use either 2 or 4 GPUs instead of 3 because achieving a global batch size of 128 is difficult with 3 GPUs.
Best, Yu
Thank you for your response. I have another question about the mismatch of results.
Why are results reported in your paper different from open LLM leadbord?
Hi,
This is because our DPO checkpoint under the Mistral-base setup is different from the Zephyr checkpoint.
There was an update of the Zephyr training recipe in hyperparameters such as beta
and num_train_epochs
~5 months ago (see here). However, the model checkpoint was not updated as far as I know.
Therefore, we trained our own DPO model under the Mistral-base setup with the updated hyperparameters of the Zephyr training recipe. We found that this new DPO model indeed led to better instruction-following scores (our DPO checkpoint has 15.1 LC win rate and 12.5 raw win rate on AlpacaEval 2 as reported in our paper, while the official Zephyr checkpoint has 13.2 LC win rate and 11.0 raw win rate according to the AlpacaEval 2 Leaderboard.
It's normal for instruction-following scores not to always correlate positively with downstream task scores on the Open LLM Leaderboard. We believe this tradeoff is mostly due to different hyperparameter choices, and it applies similarly to all the baselines we compared against. Given our focus on preference optimization, we prioritize instruction-following benchmarks over Open LLM Leaderboard tasks.
Let us know if you have further questions.
Best, Yu
Thanks for your reply. How did you test open LLM leadbord?
Did you submit your model to https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard or
use https://github.com/EleutherAI/lm-evaluation-harness by yourself?
@AGTSAAA We use the lm-evaluation-harness for evaluation and follow the settings in open-llm-leaderboard!
@xiamengzhou Thanks for your reply!
Could you please tell me which version of lm-evaluation-harness did you use?
We used this version as the open-llm-leaderboard suggested.
Hi, I use the following command to run the code
CUDA_VISIBLE_DEVICES=3,5,6 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run.py training_configs/mistral-7b-base-simpo.yaml
I found that it occupies 70GB per A100 card, even when the batch size is set to 1.
Could you help me with this? Why does it occupy so much GPU memory even with deepspeed_zero3?"