huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Add single `mmlu` config for `lighteval` suite #61

Open lewtun opened 4 months ago

lewtun commented 4 months ago

Currently it seems that to run MMLU with the lighteval suite, one needs to specify all the subsets individually as is done for leaderboard task set here.

Is it possible to group these together so that one can just run something like this:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|mmlu|5|0" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" \
    --output_dir "./scratch/evals/" --override_batch_size 1

Or do you recommend using one of the other suites like helm or original for this task?

clefourrier commented 4 months ago

Atm, it's not possible; however, if you run a task with many subsets (using a config file), you should get a display of the average at the task level in the score table.

If you want to get results comparable to the Open LLM Leaderboard, you'll need to use lighteval (you can take a look at the differences between the 3 versions here).