Open bidyapati-p opened 6 months ago
Followed this document: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
unfortunately the output of GPT-4 API can vary even with temperature=0. we recommend running multiple times and take median.
Yes, we verified judge model default paramater(GPT4): Temperature: 0, max_tokens: 2048 GPT4 score varies by 1 for many records and variation in final score is about 0.16 in 5 runs.
We ran the mt-bench multiple time with llama2-70b-chat. With generated text(step 1 - ran once), GPT4 scoring(step 2 - ran multiple time) varies. In our experiment it varied by 0.16 over 5 runs.
If we include text generation step also, it might vary more. There are 2 variations:
How is the scoring submitted in leaderboard or by others.