huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Change the eos condition for GSM8K #85

Closed clefourrier closed 4 months ago

clefourrier commented 4 months ago

Will likely require updating the test suite - pending till we figure out the datasets bug in the CI. Linked to #82

clefourrier commented 4 months ago

@NathanHB wdyt of having 6 tasks called leaderboard|task|... ? That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.

NathanHB commented 4 months ago

@NathanHB wdyt of having 6 tasks called leaderboard|task|... ? That way we could differentiate the modifs we make for more general setups from the pinned leaderboard versions.

Oh good idea, though i'm not sure anyone will use other versions if we have the leaderboard versions. We want to be able to compare results with as many models as possible.

clefourrier commented 4 months ago

Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex). I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?

NathanHB commented 4 months ago

I agree !

lewtun commented 4 months ago

Well, atm, the leaderboard versions use a pinned very old version of the harness - which led to the problems mentioned by @lewtun (for EOS tokens for ex). I think we should both adress these problems and provide a cool version of our evals, but also allow people to reproduce leaderboard scores, wdyt?

Just so I understand: in the new format, there is leaderboard|tasks|num_fewshot|0 but will lighteval still be a valid suite for e.g. gsm8k?

In other words, the leaderboard suite => same logic as old pinned version of harness, but lighteval will have various improvements etc?

clefourrier commented 4 months ago

@lewtun You understood perfectly! leaderboard|task should allow you to reproduce the current scores of the Open LLM Leaderboard. lighteval|task will follow our own logic for the task, in terms of EOS tokens and generation length.