lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
665 stars 76 forks source link

Inquire about the process for submitting our model to be included on the leaderboard. #40

Open PKU-Baichuan opened 3 months ago

PKU-Baichuan commented 3 months ago

Can you add our new models Llama3-PBM-Nova-70B to the leaderboard?

Llama3-PBM-Nova-70B has been developed using meticulously designed SFT and RLHF techniques, building on the Meta-Llama-3-70B model. The evaluation results on open-source benchmarks are provided below:

Evaluation: Model Arena-Hard MixEval-Hard Alpaca-Eval 2.0
GPT-4Turbo(04/09) 82.6% 62.6 55.0%
GPT-4o(05/13) 79.2% 64.7 57.5%
Gemini 1.5 Pro 72.0% 58.3 -
Llama3-PBM-Nova-70B 74.5% 58.1 61.23%
Llama-3.1-70B-Instruct 55.7% - 38.1%
Llama-3-70B-Instruct 46.6% 55.9 34.4%

Compared to the current state-of-the-art models, Llama3-PBM-Nova-70B has achieved top-tier performance among open-source models and can even rival or surpass the performance of some closed-source models.

Our result is attached here.