Closed clefourrier closed 4 months ago
@NathanHB wdyt of having 6 tasks called leaderboard|task|... ? That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.
@NathanHB wdyt of having 6 tasks called leaderboard|task|... ? That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.
Oh good idea, though i'm not sure anyone will use other versions if we have the leaderboard versions. We want to be able to compare results with as many models as possible.
Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex). I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?
I agree !
Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex). I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?
Just so I understand: in the new format, there is leaderboard|tasks|num_fewshot|0
but will lighteval
still be a valid suite for e.g. gsm8k
?
In other words, the leaderboard
suite => same logic as old pinned version of harness, but lighteval
will have various improvements etc?
@lewtun You understood perfectly! leaderboard|task
should allow you to reproduce the current scores of the Open LLM Leaderboard. lighteval|task
will follow our own logic for the task, in terms of EOS tokens and generation length.
Will likely require updating the test suite - pending till we figure out the datasets bug in the CI. Linked to #82