lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

How to add new models to the leaderboard? #25

Open chujiezheng opened 1 month ago

chujiezheng commented 1 month ago

Thanks for your great work. Can I request for evaluation for new models to add into the leaderboard?

CodingWithTim commented 1 month ago

Hi @chujiezheng, I am a fan of your works! We would love to add new models. Could you give us more information on the model you want to add? Currently we are just putting a very lightweight leaderboard on README doc.

chujiezheng commented 1 month ago

@CodingWithTim Thanks for your kind words! I have some HF models that I want to add:

They are ranked based on my educated guess for their performance. These models are obtained via our recently proposed ExPO (model extrapolation) method. You can find more ExPO-enhanced models in this 🤗 HuggingFace collection and see their performance on the AlpacaEval 2.0 leaderboard.

Due to the API and GPU limits, currently I have only ran the evaluation for Starling-LM-7B-beta-ExPO, which obtains a score of 24.9 and a 95% CI of (-2.2, 1.8). I attach the evaluation output files here. I will appreciate it if you could add Starling-LM-7B-beta-ExPO to the leaderboard. I will also greatly appreciate it if you could help evaluate the above other models and add them to the leaderboard.

BTW, as many research work has built their evaluation on Arena-Hard, do you have plans to build a leaderboard website like AlpacaEval?