lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
606 stars 71 forks source link

Bradley-Terry model #22

Closed dmitrysarov closed 5 months ago

dmitrysarov commented 5 months ago

First of all thanx for your work. Maybe I have misunderstood, but I could not file Bradley-Terry model usage/implementation in your code, instead you are doing something interesting with LogReg coefficient. Please, can you point to the source of idea behind? And do you think that Bradley-Terry model will perform worse than this LR trick?

CodingWithTim commented 5 months ago

Hi! The code you pointed out is in fact Bradley-Terry where we perform the reweighted maximum likelihood estimation. The code is essentially the same as the one used to compute the ranking for Chatbot Arena. Full detail of the math can be found in our Chatbot Arena paper's math section. https://arxiv.org/pdf/2403.04132