Align GPQA zero-shot / few-shot prompts with paper?

lewtun commented 4 months ago

GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:

What is the correct answer to this question: {QUESTION}
Choices:
(A) {CHOICE_A}
(B) {CHOICE_B}
(C) {CHOICE_C}
(D) {CHOICE_D}

Format your response as follows: "The correct answer is (insert answer here)".

In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:

We extracted answers from the model response using a simple regex matching phrases like ‘answer is’, ‘answer:’ etc.

However, inspecting the details from lighteval I see we have the following for zero-shot:

Select the correct answer to the following questions.

Question: Identify the final product produced when cyclobutyl(cyclopropyl)methanol reacts with phosphoric acid in water.
A. spiro[3.4]oct-5-ene
B. 1,2-dimethylcyclohexa-1,4-diene
C. 1,2,3,4,5,6-hexahydropentalene
D. [1,1'-bi(cyclobutan)]-1-ene
Answer:

The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.

Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that lighteval samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?

Happy to take a stab at this one if you agree!

clefourrier commented 4 months ago

Cool points! We could completely have 2 versions, one multichoice looking at logprobs (which is cool because very, very fast) and the other following the original implem as closely as possible, therefore being generative if I understood well.

You can add the second one under the original keyword if you want :smiley:

Regarding the few shot CoT prompt, let's add it to #8 and do it in another PR - we'll notably need to change the format a bit if we want to allow to pass fixed few shot example files for example. Wdyt?

lewtun commented 4 months ago

Yes, a generative version sounds great! I can start with the vanilla zero-shot and few-shot prompts and we add the CoT ones later as you suggest :)

clefourrier commented 4 months ago

Sounds good, I'll assign this to you then :)

huggingface / lighteval

Align GPQA zero-shot / few-shot prompts with paper? #70