lmarena arena-hard-auto issues

lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.

Apache License 2.0

656 stars 74 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

generate answerrs for FT models

#53 baishalichaudhury closed 1 week ago
2
Can provide CSV file of newest Chatbot Arena LLM Leaderboard (2024-11-04)

#52 efsotr closed 1 week ago
0
Choosing weight and num_round setting for evaluation

#51 YJWon99 closed 1 week ago
1
About the style control leaderboard

#50 yangzy39 closed 2 weeks ago
3
The Replacement of Open-source JudgeModel/Evaluator

#49 bittersweet1999 opened 1 month ago
8
chore: update show_result.py

#48 eltociear opened 1 month ago
0
another hard prompt

#47 maninthemiddle01 closed 2 months ago
1
Improve reproducibility in utils_math.py

#46 dustalov opened 2 months ago
0
Can you release the model's answers and judgments for the models you ran on your benchmark?

#45 AsafYehudai closed 2 months ago
1
Add litellm, unified dataclass description, and compatibility with vision-language models

#44 BabyChouSr closed 1 month ago
2
Question about Llama-3.1-405b-instruct's results

#43 snova-bol closed 2 months ago
4
Add filter step in BenchBuilder

#42 BabyChouSr closed 2 months ago
1
Add support for vision-language conversations

#41 BabyChouSr closed 2 months ago
2
Inquire about the process for submitting our model to be included on the leaderboard.

#40 PKU-Baichuan opened 3 months ago
0
Conv should be defined within choice loop

#39 zankner closed 2 months ago
1
added merge leaderboard function

#38 connorchenn closed 3 months ago
1
added gpt 4o mini to leaderboard

#37 connorchenn closed 4 months ago
0
updated leaderboard

#36 connorchenn closed 4 months ago
1
remove leaderboard from root directory

#35 connorchenn closed 4 months ago
1
Fix typo in README

#34 PaperPlaneDeemo closed 4 months ago
1
new README

#33 connorchenn closed 4 months ago
5
edit README

#32 connorchenn closed 4 months ago
1
add export csv option in show_result.py

#31 connorchenn closed 4 months ago
2
Add TGI to readme

#30 karthik-nexusflow closed 5 months ago
1
Can you add deepseek-coder-v2?

#29 Kreijstal closed 5 months ago
1
Fix corner-case in token length calculation when the model generates tiktoken special tokens like `<|endoftext|>`

#28 sxjscience closed 5 months ago
2
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

#27 xiamengzhou closed 4 months ago
6
Is there any plan to share the full dataset (200k prompts) with the "number of hardness criteria met" label ? I think it would be quite useful to the community

#26 alexchapeaux closed 5 months ago
1
How to add new models to the leaderboard?

#25 chujiezheng opened 5 months ago
2
Fix `winner` in `show_result.py` for game 2

#23 alvarobartt closed 5 months ago
2
Bradley-Terry model

#22 dmitrysarov closed 6 months ago
1
configurable parameters

#21 dmitrysarov closed 5 months ago
5
[Q] About hosting `arena-hard-v0.1/question.json` in the Hugging Face Hub

#19 alvarobartt closed 2 months ago
4
[Bug] Temperature is always `0.0`

#18 bcui19 closed 6 months ago
1
Multi-threads generation support ?

#17 Ignoramus0817 closed 6 months ago
1
Discrepancy in Scores When Switching GPT Model Versions

#16 wlhgtc closed 6 months ago
10
[Discussion] Methodology for bootstrapping with replacement to obtain separability confidence intervals

#15 justinxzhao closed 6 months ago
2
Bug in get_battles_from_judgment

#14 tangbinh closed 6 months ago
1
[Feature] support arena-hard in opencompass

#13 bittersweet1999 closed 3 months ago
2
docs: add `git-lfs` note in `README.md`

#12 xukai92 closed 7 months ago
0
add missing deps for `show_result.py`

#11 xukai92 closed 7 months ago
1
Models testing themselves will always be biased.

#10 HideLord closed 7 months ago
1
Allow to set generation sampling parameters

#9 psinger closed 5 months ago
11
CI results different for same model answer copy

#8 qingquansong closed 7 months ago
2
Evaluate local models

#7 xiamengzhou closed 7 months ago
2
Only support baseline=True and pairwise=True?

#6 GradientGuru closed 7 months ago
1
Majority of questions are coding questions!

#5 nxphi47 closed 7 months ago
2
Markdown Rendering Issue

#4 suquark closed 7 months ago
1
Fix the order of questions.jsonl on Huggingface

#3 infwinston closed 7 months ago
0
QA browser does not work properly for me

#2 suquark closed 7 months ago
3