mlabonne / llm-autoeval

Automatically evaluate your LLMs in Google Colab
MIT License
460 stars 77 forks source link

Openllm fix #24

Closed mlabonne closed 3 months ago

mlabonne commented 3 months ago

It took me a while but it fixes an old issue with the MMLU dataset. It also replaces vllm with accelerate to be more consistent with the Open LLM Leaderboard's results. It doesn't use the same version of lm-evaluation-harness for convenience, but results look very close. For instance:

Model ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K Average
pythia-70m 22.18 27.39 25.29 46.84 51.07 0.23 28.83

Compared with https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Know issue: the tables summarizing the results are poorly formatted. I don't think it's too important so I'll hopefully fix it later, I made several inconclusive attempts.