lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

Only support baseline=True and pairwise=True? #6

Closed GradientGuru closed 2 months ago

GradientGuru commented 2 months ago

the prompt template in Github is comparing two models instead of scoring

CodingWithTim commented 2 months ago

Hi! If you haven't already, please check out the evaluation process detailed in blog post.

Unlike MT Bench, Arena Hard v0.1 uses an enhanced pairwise comparison method to evaluate models. We found this method, to work better than single score judging. But, baseline and pairwise don't have to be set to true.

If you set baseline=False and remove \n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|> in the judge_config's prompt_template, then your judge will only look at 1 answers. This way you can easily judge MT bench as well once you change the system prompt and the regex pattern to work with MT bench single score judgment.

If you set pairwise=False, then you will only evaluate each prompt once. Due to positional bias and variances, we recommend evaluate each prompt twice, once with baseline answer positioned as the first answer and once with baseline as the second answer. But you can set pairwise=False to save cost when evaluating many checkpoints.

Hope this is helpful!