Open endolith opened 8 months ago
We publish Arena battle data with timestamp here at https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=o_CpbkGEbhrK
You can use a sliding window to plot model's rating over time. Could you contribute a PR?
So I tried this over the weekend and made some graphs, but the library doesn't handle ties, and doesn't seem to report uncertainty correctly, and I don't know how to decide what the w² value should be set to:
w is a parameter of the model, that indicates the variability of ratings in time. The extreme case of w = 0 would mean static ratings.
The Wiener process had a variance of w2 = 60 Elo2 per day.
Is there any reason why a pre-trained LLM model would change skill over time? Only thing I can think of is network outages or the like that cause people to vote down that model temporarily (which I am guilty of). But otherwise the model weights and inference implementation are always the same, right? ("In the context of LLM evaluation, models can be assumed to be static.") So I wonder if ideally there would be a way to mark certain models as having w = 0 and others (API calls or model with internet access, etc.) as having skill that could possibly change?
(Also I arbitrarily added 1000 to the results to make it look more like Elo ratings, but I don't know if that really makes them equivalent to Elo ratings.)
(I also want to try counting "both are bad" as a loss of both models against a "HumanEvaluator" model that would serve as a sort of benchmark for ideal LLM performance, but I need to figure out how to implement ties first.)
For the API-based models, there are frequent claims online that users see models getting worse over time. It would be good to know if that's true. Copying a comment of mine from HF:
I know there are a bunch of Elo variants, but never learned the exact differences. Here is one summary:
I know Glicko has a measure of uncertainty built-in, not sure how that compares to lmsys' bootstrap method.
Maybe WHR would be a better choice? I know WHR is used to track rock climber skill over time, for instance. From their own paper, they say:
WHR can show how models change in skill over time, and how confident we can be in the measurement: