I used the latest code to test the mt bench score of llama-2-chat, and the test result was only about 5.86. However, the official data provided was as high as around 6.3. For my own model, using the same response, the average difference between the two GPT4 scores was surprisingly about 0.2. Additionally, the issue in # 2659 seems to have not been resolved yet, and I am not sure if this is the cause of the error
I used the latest code to test the mt bench score of llama-2-chat, and the test result was only about 5.86. However, the official data provided was as high as around 6.3. For my own model, using the same response, the average difference between the two GPT4 scores was surprisingly about 0.2. Additionally, the issue in # 2659 seems to have not been resolved yet, and I am not sure if this is the cause of the error