lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.56k stars 4.51k forks source link

The accuracy issue of MT bench #3386

Open Luoqiu76 opened 3 months ago

Luoqiu76 commented 3 months ago

I used the latest code to test the mt bench score of llama-2-chat, and the test result was only about 5.86. However, the official data provided was as high as around 6.3. For my own model, using the same response, the average difference between the two GPT4 scores was surprisingly about 0.2. Additionally, the issue in # 2659 seems to have not been resolved yet, and I am not sure if this is the cause of the error