There seems to be a discrepancy between the leaderboard and this repository, which may end up meaning that models were benchmarked with different settings than reported.
Specifically, m_hellaswag seems to have specified to use 0 few shot examples even though the leaderboard says 10.
There seems to be a discrepancy between the leaderboard and this repository, which may end up meaning that models were benchmarked with different settings than reported.
Specifically,
m_hellaswag
seems to have specified to use0
few shot examples even though the leaderboard says10
.https://github.com/laiviet/lm-evaluation-harness/blob/10cb5292748e882c22db7eed49a380089645c4c2/lm_eval/tasks/multilingual_hellaswag.py#L48-L56
Similarly, MMLU has
5
in the leaderboard but25
in the code.https://github.com/laiviet/lm-evaluation-harness/blob/10cb5292748e882c22db7eed49a380089645c4c2/lm_eval/tasks/multilingual_mmlu.py#L44-L48
Where should it be corrected - on the leaderboard or in the code? And what are the consequences for the models that you report?