lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.56k stars 4.51k forks source link

How does calibrating for an anchor model in `elo_analysis` affect the elo ratings and bootstrap CIs? #3377

Open acylam opened 3 months ago

acylam commented 3 months ago

Hi, I've been searching for an explanation for the choice of scaling the elo scores based on one of the models in elo_analysis when calculating the MLE elo ratings, but did not find any. Specifically, I see this line added at the end of compute_elo_mle_with_tie:

if "mixtral-8x7b-instruct-v0.1" in models.index:
        elo_scores += 1114 - elo_scores[models["mixtral-8x7b-instruct-v0.1"]]

This effectively anchors the scores to mixtral-8x7b-instruct-v0.1 with a rating of 1114. Any explanation for the choice of the model and the seemingly arbitrary number 1114? How does adding this affect the overall elo ratings and corresponding bootstrap CIs? Is it chosen so that elo ratings can be compared over time? I'm asking because I noticed that the bootstrap CIs become a lot wider with the anchor model and I'm not sure which model to choose and number to set for my own set of models.

Any help will be appreciated, thanks!