lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

Bug in get_battles_from_judgment #14

Closed tangbinh closed 2 months ago

tangbinh commented 2 months ago

It looks like the winners are not correctly specified in the get_battles_from_judgment function. Can you take a look and fix it?

https://github.com/lm-sys/arena-hard/blob/5eb649883765107f42b171962683829fd064a63f/show_result.py#L160-L168

CodingWithTim commented 2 months ago

Hi! The winner is specified correctly. We do 2 games of judgment per prompt: the first game we put the baseline model first and the second game we put the baseline model after the model answer. So in the second game, "A>B" means the model answer is preferred over the baseline. Thanks!