lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
665 stars 76 forks source link

Fix `winner` in `show_result.py` for game 2 #23

Closed alvarobartt closed 6 months ago

alvarobartt commented 6 months ago

Description

This PR fixes a bug within the winner field for the game 2, since when either A>B or A>>B the winner is set to model_b while it should be model_a; also applies to the other way around.

cc @CodingWithTim

CodingWithTim commented 6 months ago

Hi, sorry about the confusion. We run 2 games of judgment for each answer against the baseline. We switch position of the answers present to the judge to mitigate positional bias. That is why on the second game, A>>B or A>B means model_b wins. The code is already implemented correctly.

alvarobartt commented 5 months ago

Hi, sorry about the confusion. We run 2 games of judgment for each answer against the baseline. We switch position of the answers present to the judge to mitigate positional bias. That is why on the second game, A>>B or A>B means model_b wins. The code is already implemented correctly.

Oh sorry for the oversight then!