laiviet / lm-evaluation-harness

Apache License 2.0
8 stars 5 forks source link

NUM_FEW_SHOT does not correspond with the leaderboard #3

Open BramVanroy opened 9 months ago

BramVanroy commented 9 months ago

There seems to be a discrepancy between the leaderboard and this repository, which may end up meaning that models were benchmarked with different settings than reported.

Specifically, m_hellaswag seems to have specified to use 0 few shot examples even though the leaderboard says 10.

https://github.com/laiviet/lm-evaluation-harness/blob/10cb5292748e882c22db7eed49a380089645c4c2/lm_eval/tasks/multilingual_hellaswag.py#L48-L56

Similarly, MMLU has 5 in the leaderboard but 25 in the code.

https://github.com/laiviet/lm-evaluation-harness/blob/10cb5292748e882c22db7eed49a380089645c4c2/lm_eval/tasks/multilingual_mmlu.py#L44-L48

Where should it be corrected - on the leaderboard or in the code? And what are the consequences for the models that you report?

Taishi-N324 commented 7 months ago

In the same way as @BramVanroy, the number of few-shot examples doesn't match, and I can't reproduce the results of the leaderboard.