Closed lewtun closed 4 months ago
The task is defined in the tasks_table.jsonl
, in the following line:
{"name":"gpqa","suite":["lighteval"],"prompt_function":"gpqa","hf_repo":"Idavidrein/gpqa","hf_subset":"gpqa_main","hf_avail_splits":["train"],"evaluation_splits":["train"],"few_shots_split":null,"few_shots_select":"random_sampling","generation_size":1,"metric":["loglikelihood_acc_single_token"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}
We select gpqa_main
as subset, and use loglikelihood_acc_single_token
as metric, which means that the aggregation is an average and the eval is comparing the logprobs of the different choices (A/B/C/D/...).
Side note: 0.25 is random performance, so not that high ^^
Here's the Mixtral scores, which are roughly consistent with those from the Reka blog post:
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
lighteval:gpqa:0 | 0 | acc | 0.2455 | ± | 0.0204 |
lighteval:gpqa:5 | 0 | acc | 0.2545 | ± | 0.0206 |
It is quite surprising that the model performs no better than random chance :)
Closing this since it was a user confusion on my side about the benchmark.
Edit: after posting this, I realised that 25% accuracy is the same as random chance, so we should expect most small models to be around this range.
When running a small Qwen model through GPQA, I am getting anomalously large scores compared to much larger models like Llama-70b-chat in the paper. For reference, here's the values from the paper (see also this blog post for some other model comparisons):
Now, when I run both 0-shot and 5-shot evals via:
I get:
These values are anomalously large for such a small model and I wonder if there's some issue in how we aggregate results?
A related question is whether we report the average accuracy across the extended / main / diamond sets or something else? I noticed in the Hub dataset that 4 configs are provided, but the task table just specifies the
train
split which I suspect just loads everything (including the expert annotations)I will also check Mixtral to see if the scores a much different