Only support baseline=True and pairwise=True?

Hi! If you haven't already, please check out the evaluation process detailed in blog post.

Unlike MT Bench, Arena Hard v0.1 uses an enhanced pairwise comparison method to evaluate models. We found this method, to work better than single score judging. But, baseline and pairwise don't have to be set to true.

If you set baseline=False and remove \n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|> in the judge_config's prompt_template, then your judge will only look at 1 answers. This way you can easily judge MT bench as well once you change the system prompt and the regex pattern to work with MT bench single score judgment.

If you set pairwise=False, then you will only evaluate each prompt once. Due to positional bias and variances, we recommend evaluate each prompt twice, once with baseline answer positioned as the first answer and once with baseline as the second answer. But you can set pairwise=False to save cost when evaluating many checkpoints.

Hope this is helpful!

lm-sys / arena-hard-auto

Only support baseline=True and pairwise=True? #6