Closed snova-bol closed 1 month ago
Hi! We use the same endpoints as Chatbot Arena: llama-3.1-405b-instruct-fp8
from anyscale. Additionally we also add the following system prompt: Cutting Knowledge Date: December 2023\nToday Date: 31 Aug 2024
when using Llama-3.1-405b-instruct on both Chatbot Arena and Arena-Hard-Auto. You can add system instruction when generating model answers by adding system_prompt to the api_config.yaml
:
gpt-3.5-turbo-0125:
model_name: gpt-3.5-turbo-0125
endpoints: null
api_type: openai
parallel: 8
system_prompt: [insert system instruction]
Notably when we tested what happens if we don't add this system prompt, we observe a degradation in performance.
llama-3.1-405b-instruct-fp8 | score: 69.3 | 95% CI: (-2.2, 2.7) | average #tokens: 658
llama-3.1-405b-instruct-fp8-no-sys-prompt | score: 64.2 | 95% CI: (-2.2, 2.4) | average #tokens: 635
The leaderboard presented in the README.md is not style controlled.
Thanks @CodingWithTim ! One more question: is there a documentation explaining how you get the 60%ish winrate from gpt4 judges? Is that by sampling 100 problems from it, and then compute winrate, or some more sophisticated equation? Thanks a lot again!
No problem! So the number 69.3% is the win-rate against gpt-4-0314 (default baseline), which is produced using >python show_result.py
.
We did not subsample. We used the same code as the code that is on the repo. So if you are using >python show_result.py
, then we are using the same setup.
The code in >python show_result.py
first computes the Bradley Terry Coefficients for each model, and then recompute the win-rate against the gpt-4-0314 baseline. Since every model is only compared against a single baseline, the win-rate is invariant to the number of models. Further, we set a significant win as 3 wins (eg. A>>B), a significant loss as 3 losses, a small win (eg. A>B) as 1 win, a small loss as 1 loss, and tie (eg. A=B) as a single tie. Thus the win-rate is computed as (total number of win + 0.5 * total number of ties) / total number of win, tie, and loss.
I just pushed the llama-3.1-405b-instruct-fp8 generation and judgment file to repo, feel free to check it out. If you compute the win-rate against gpt-4-0314 (default baseline), you should get 69.3%. I reproduced this number on my end, feel free to try it on your end.
Also here is the code to only calculate the win-rate given any judgment file:
import pandas as pd
judgment = pd.read_json(judgment_file, lines=True)
win_map_1 = {"B>A": ["model"],
"B>>A": ["model"] * 3,
"A>B": ["baseline"],
"A>>B": ["baseline"] * 3,
"A=B": ["tie"]}
win_map_2 = {"B>A": ["baseline"],
"B>>A": ["baseline"] * 3,
"A>B": ["model"],
"A>>B": ["model"] * 3,
"A=B": ["tie"]}
outcomes = pd.concat([judgment.games.map(lambda x: win_map_1[x[0]["score"]]).explode(),
judgment.games.map(lambda x: win_map_2[x[1]["score"]]).explode()])
outcomes.value_counts()
If you try this on the judgment file I pushed to huggingface, you should also get 69.3%. Note, this is not using Bradley Terry Coefficient, indicating the Bradley Terry step is invariant to the raw win-rate.
Hi thanks for the great work! Would you mind sharing how you collected the 405b-instruct number? We measured locally but got 62-63%, not the 69% in the leaderboard. Do you get the number with specific system prompt/style control?
Thanks a lot!