lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.03k stars 4.43k forks source link

Variation in MT-bench score #3018

Open bidyapati-p opened 6 months ago

bidyapati-p commented 6 months ago

We ran the mt-bench multiple time with llama2-70b-chat. With generated text(step 1 - ran once), GPT4 scoring(step 2 - ran multiple time) varies. In our experiment it varied by 0.16 over 5 runs.

If we include text generation step also, it might vary more. There are 2 variations:

  1. Text generation has temperature 0.7 for writing and roleplay, this may lead to variation in text generated from test model.
  2. Scoring by GPT-4: it vary every time we run for same generated text.

How is the scoring submitted in leaderboard or by others.

  1. Are we running once only
  2. Are we running multiple and submit the best score
  3. Are we running multiple and submit avg score
bidyapati-p commented 6 months ago

Followed this document: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

infwinston commented 6 months ago

unfortunately the output of GPT-4 API can vary even with temperature=0. we recommend running multiple times and take median.

bidyapati-p commented 6 months ago

Yes, we verified judge model default paramater(GPT4): Temperature: 0, max_tokens: 2048 GPT4 score varies by 1 for many records and variation in final score is about 0.16 in 5 runs.