It took me a while but it fixes an old issue with the MMLU dataset. It also replaces vllm with accelerate to be more consistent with the Open LLM Leaderboard's results. It doesn't use the same version of lm-evaluation-harness for convenience, but results look very close. For instance:
Know issue: the tables summarizing the results are poorly formatted. I don't think it's too important so I'll hopefully fix it later, I made several inconclusive attempts.
It took me a while but it fixes an old issue with the MMLU dataset. It also replaces vllm with accelerate to be more consistent with the Open LLM Leaderboard's results. It doesn't use the same version of lm-evaluation-harness for convenience, but results look very close. For instance:
Compared with https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Know issue: the tables summarizing the results are poorly formatted. I don't think it's too important so I'll hopefully fix it later, I made several inconclusive attempts.