Closed GradientGuru closed 7 months ago
Hi! If you haven't already, please check out the evaluation process detailed in blog post.
Unlike MT Bench, Arena Hard v0.1 uses an enhanced pairwise comparison method to evaluate models. We found this method, to work better than single score judging. But, baseline
and pairwise
don't have to be set to true.
If you set baseline=False
and remove \n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|>
in the judge_config's prompt_template
, then your judge will only look at 1 answers. This way you can easily judge MT bench as well once you change the system prompt and the regex pattern to work with MT bench single score judgment.
If you set pairwise=False
, then you will only evaluate each prompt once. Due to positional bias and variances, we recommend evaluate each prompt twice, once with baseline answer positioned as the first answer and once with baseline as the second answer. But you can set pairwise=False
to save cost when evaluating many checkpoints.
Hope this is helpful!
the prompt template in Github is comparing two models instead of scoring