lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
656 stars 74 forks source link

Bug in get_battles_from_judgment #14

Closed tangbinh closed 6 months ago

tangbinh commented 6 months ago

It looks like the winners are not correctly specified in the get_battles_from_judgment function. Can you take a look and fix it?

https://github.com/lm-sys/arena-hard/blob/5eb649883765107f42b171962683829fd064a63f/show_result.py#L160-L168

CodingWithTim commented 6 months ago

Hi! The winner is specified correctly. We do 2 games of judgment per prompt: the first game we put the baseline model first and the second game we put the baseline model after the model answer. So in the second game, "A>B" means the model answer is preferred over the baseline. Thanks!