Anomalously high scores on GPQA

lewtun commented 4 months ago

Edit: after posting this, I realised that 25% accuracy is the same as random chance, so we should expect most small models to be around this range.

When running a small Qwen model through GPQA, I am getting anomalously large scores compared to much larger models like Llama-70b-chat in the paper. For reference, here's the values from the paper (see also this blog post for some other model comparisons):

Now, when I run both 0-shot and 5-shot evals via:

# 0-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|0|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

# 5-shot
accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --tasks="lighteval|gpqa|5|0" --output_dir "./scratch/evals" --model_args "pretrained=Qwen/Qwen1.5-0.5B-Chat" --override_batch_size 1

I get:

Task	Version	Metric	Value		Stderr
lighteval:gpqa:0	0	acc	0.2567	±	0.0207
lighteval:gpqa:5	0	acc	0.2679	±	0.0209

These values are anomalously large for such a small model and I wonder if there's some issue in how we aggregate results?

A related question is whether we report the average accuracy across the extended / main / diamond sets or something else? I noticed in the Hub dataset that 4 configs are provided, but the task table just specifies the train split which I suspect just loads everything (including the expert annotations)

I will also check Mixtral to see if the scores a much different

clefourrier commented 4 months ago

The task is defined in the tasks_table.jsonl, in the following line: {"name":"gpqa","suite":["lighteval"],"prompt_function":"gpqa","hf_repo":"Idavidrein/gpqa","hf_subset":"gpqa_main","hf_avail_splits":["train"],"evaluation_splits":["train"],"few_shots_split":null,"few_shots_select":"random_sampling","generation_size":1,"metric":["loglikelihood_acc_single_token"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}

We select gpqa_main as subset, and use loglikelihood_acc_single_token as metric, which means that the aggregation is an average and the eval is comparing the logprobs of the different choices (A/B/C/D/...).

clefourrier commented 4 months ago

Side note: 0.25 is random performance, so not that high ^^

lewtun commented 4 months ago

Here's the Mixtral scores, which are roughly consistent with those from the Reka blog post:

Task	Version	Metric	Value		Stderr
lighteval:gpqa:0	0	acc	0.2455	±	0.0204
lighteval:gpqa:5	0	acc	0.2545	±	0.0206

It is quite surprising that the model performs no better than random chance :)

lewtun commented 4 months ago

Closing this since it was a user confusion on my side about the benchmark.

huggingface / lighteval

Anomalously high scores on GPQA #68