Variation in MT-bench score

bidyapati-p commented 6 months ago

We ran the mt-bench multiple time with llama2-70b-chat. With generated text(step 1 - ran once), GPT4 scoring(step 2 - ran multiple time) varies. In our experiment it varied by 0.16 over 5 runs.

If we include text generation step also, it might vary more. There are 2 variations:

Text generation has temperature 0.7 for writing and roleplay, this may lead to variation in text generated from test model.
Scoring by GPT-4: it vary every time we run for same generated text.

How is the scoring submitted in leaderboard or by others.

Are we running once only
Are we running multiple and submit the best score
Are we running multiple and submit avg score

bidyapati-p commented 6 months ago

Followed this document: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

infwinston commented 6 months ago

unfortunately the output of GPT-4 API can vary even with temperature=0. we recommend running multiple times and take median.

bidyapati-p commented 6 months ago

Yes, we verified judge model default paramater(GPT4): Temperature: 0, max_tokens: 2048 GPT4 score varies by 1 for many records and variation in final score is about 0.16 in 5 runs.

lm-sys / FastChat

Variation in MT-bench score #3018