huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Align GPQA zero-shot / few-shot prompts with paper? #70

Open lewtun opened 4 months ago

lewtun commented 4 months ago

GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:

What is the correct answer to this question: {QUESTION}
Choices:
(A) {CHOICE_A}
(B) {CHOICE_B}
(C) {CHOICE_C}
(D) {CHOICE_D}

Format your response as follows: "The correct answer is (insert answer here)".

In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:

We extracted answers from the model response using a simple regex matching phrases like ‘answer is’, ‘answer:’ etc.

However, inspecting the details from lighteval I see we have the following for zero-shot:

Select the correct answer to the following questions.

Question: Identify the final product produced when cyclobutyl(cyclopropyl)methanol reacts with phosphoric acid in water.
A. spiro[3.4]oct-5-ene
B. 1,2-dimethylcyclohexa-1,4-diene
C. 1,2,3,4,5,6-hexahydropentalene
D. [1,1'-bi(cyclobutan)]-1-ene
Answer: 

The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.

Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that lighteval samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?

Happy to take a stab at this one if you agree!

clefourrier commented 4 months ago

Cool points! We could completely have 2 versions, one multichoice looking at logprobs (which is cool because very, very fast) and the other following the original implem as closely as possible, therefore being generative if I understood well.

You can add the second one under the original keyword if you want :smiley:

Regarding the few shot CoT prompt, let's add it to #8 and do it in another PR - we'll notably need to change the format a bit if we want to allow to pass fixed few shot example files for example. Wdyt?

lewtun commented 4 months ago

Yes, a generative version sounds great! I can start with the vanilla zero-shot and few-shot prompts and we add the CoT ones later as you suggest :)

clefourrier commented 4 months ago

Sounds good, I'll assign this to you then :)