huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Anomalously high scores on GPQA #68

Closed lewtun closed 4 months ago

lewtun commented 4 months ago

Edit: after posting this, I realised that 25% accuracy is the same as random chance, so we should expect most small models to be around this range.

When running a small Qwen model through GPQA, I am getting anomalously large scores compared to much larger models like Llama-70b-chat in the paper. For reference, here's the values from the paper (see also this blog post for some other model comparisons):

Screenshot 2024-02-27 at 15 29 06

Now, when I run both 0-shot and 5-shot evals via:

# 0-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|0|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

# 5-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|5|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

I get:

Task Version Metric Value Stderr
lighteval:gpqa:0 0 acc 0.2567 ± 0.0207
lighteval:gpqa:5 0 acc 0.2679 ± 0.0209

These values are anomalously large for such a small model and I wonder if there's some issue in how we aggregate results?

A related question is whether we report the average accuracy across the extended / main / diamond sets or something else? I noticed in the Hub dataset that 4 configs are provided, but the task table just specifies the train split which I suspect just loads everything (including the expert annotations)

Screenshot 2024-02-27 at 15 31 56

I will also check Mixtral to see if the scores a much different

clefourrier commented 4 months ago

The task is defined in the tasks_table.jsonl, in the following line: {"name":"gpqa","suite":["lighteval"],"prompt_function":"gpqa","hf_repo":"Idavidrein/gpqa","hf_subset":"gpqa_main","hf_avail_splits":["train"],"evaluation_splits":["train"],"few_shots_split":null,"few_shots_select":"random_sampling","generation_size":1,"metric":["loglikelihood_acc_single_token"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}

We select gpqa_main as subset, and use loglikelihood_acc_single_token as metric, which means that the aggregation is an average and the eval is comparing the logprobs of the different choices (A/B/C/D/...).

clefourrier commented 4 months ago

Side note: 0.25 is random performance, so not that high ^^

lewtun commented 4 months ago

Here's the Mixtral scores, which are roughly consistent with those from the Reka blog post:

Task Version Metric Value Stderr
lighteval:gpqa:0 0 acc 0.2455 ± 0.0204
lighteval:gpqa:5 0 acc 0.2545 ± 0.0206

It is quite surprising that the model performs no better than random chance :)

lewtun commented 4 months ago

Closing this since it was a user confusion on my side about the benchmark.