lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
606 stars 71 forks source link

Question about Llama-3.1-405b-instruct's results #43

Closed snova-bol closed 1 month ago

snova-bol commented 1 month ago

Hi thanks for the great work! Would you mind sharing how you collected the 405b-instruct number? We measured locally but got 62-63%, not the 69% in the leaderboard. Do you get the number with specific system prompt/style control?

Thanks a lot!

CodingWithTim commented 1 month ago

Hi! We use the same endpoints as Chatbot Arena: llama-3.1-405b-instruct-fp8 from anyscale. Additionally we also add the following system prompt: Cutting Knowledge Date: December 2023\nToday Date: 31 Aug 2024 when using Llama-3.1-405b-instruct on both Chatbot Arena and Arena-Hard-Auto. You can add system instruction when generating model answers by adding system_prompt to the api_config.yaml:

gpt-3.5-turbo-0125:
    model_name: gpt-3.5-turbo-0125
    endpoints: null
    api_type: openai
    parallel: 8
    system_prompt: [insert system instruction]

Notably when we tested what happens if we don't add this system prompt, we observe a degradation in performance.

llama-3.1-405b-instruct-fp8               | score: 69.3  | 95% CI: (-2.2, 2.7)  | average #tokens: 658                      
llama-3.1-405b-instruct-fp8-no-sys-prompt | score: 64.2  | 95% CI: (-2.2, 2.4)  | average #tokens: 635                     

The leaderboard presented in the README.md is not style controlled.

snova-bol commented 1 month ago

Thanks @CodingWithTim ! One more question: is there a documentation explaining how you get the 60%ish winrate from gpt4 judges? Is that by sampling 100 problems from it, and then compute winrate, or some more sophisticated equation? Thanks a lot again!

CodingWithTim commented 1 month ago

No problem! So the number 69.3% is the win-rate against gpt-4-0314 (default baseline), which is produced using >python show_result.py.

We did not subsample. We used the same code as the code that is on the repo. So if you are using >python show_result.py, then we are using the same setup.

The code in >python show_result.py first computes the Bradley Terry Coefficients for each model, and then recompute the win-rate against the gpt-4-0314 baseline. Since every model is only compared against a single baseline, the win-rate is invariant to the number of models. Further, we set a significant win as 3 wins (eg. A>>B), a significant loss as 3 losses, a small win (eg. A>B) as 1 win, a small loss as 1 loss, and tie (eg. A=B) as a single tie. Thus the win-rate is computed as (total number of win + 0.5 * total number of ties) / total number of win, tie, and loss.

I just pushed the llama-3.1-405b-instruct-fp8 generation and judgment file to repo, feel free to check it out. If you compute the win-rate against gpt-4-0314 (default baseline), you should get 69.3%. I reproduced this number on my end, feel free to try it on your end.

CodingWithTim commented 1 month ago

Also here is the code to only calculate the win-rate given any judgment file:

import pandas as pd

judgment = pd.read_json(judgment_file, lines=True)

win_map_1 = {"B>A": ["model"],
           "B>>A": ["model"] * 3,
           "A>B": ["baseline"],
           "A>>B": ["baseline"] * 3,
           "A=B": ["tie"]}
win_map_2 = {"B>A": ["baseline"],
           "B>>A": ["baseline"] * 3,
           "A>B": ["model"],
           "A>>B": ["model"] * 3,
           "A=B": ["tie"]}

outcomes = pd.concat([judgment.games.map(lambda x: win_map_1[x[0]["score"]]).explode(),
           judgment.games.map(lambda x: win_map_2[x[1]["score"]]).explode()])

outcomes.value_counts()

If you try this on the judgment file I pushed to huggingface, you should also get 69.3%. Note, this is not using Bradley Terry Coefficient, indicating the Bradley Terry step is invariant to the raw win-rate.