lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

Fix `winner` in `show_result.py` for game 2 #23

Closed alvarobartt closed 1 month ago

alvarobartt commented 1 month ago

Description

This PR fixes a bug within the winner field for the game 2, since when either A>B or A>>B the winner is set to model_b while it should be model_a; also applies to the other way around.

cc @CodingWithTim

CodingWithTim commented 1 month ago

Hi, sorry about the confusion. We run 2 games of judgment for each answer against the baseline. We switch position of the answers present to the judge to mitigate positional bias. That is why on the second game, A>>B or A>B means model_b wins. The code is already implemented correctly.

alvarobartt commented 1 month ago

Hi, sorry about the confusion. We run 2 games of judgment for each answer against the baseline. We switch position of the answers present to the judge to mitigate positional bias. That is why on the second game, A>>B or A>B means model_b wins. The code is already implemented correctly.

Oh sorry for the oversight then!