Open lewtun opened 4 months ago
Cool points! We could completely have 2 versions, one multichoice looking at logprobs (which is cool because very, very fast) and the other following the original implem as closely as possible, therefore being generative if I understood well.
You can add the second one under the original
keyword if you want :smiley:
Regarding the few shot CoT prompt, let's add it to #8 and do it in another PR - we'll notably need to change the format a bit if we want to allow to pass fixed few shot example files for example. Wdyt?
Yes, a generative version sounds great! I can start with the vanilla zero-shot and few-shot prompts and we add the CoT ones later as you suggest :)
Sounds good, I'll assign this to you then :)
GPQA uses a fixed prompt for zero-shot and few-shot evaluation (see Appendix A.3.1 of the paper). For example, this is the format of the zero-shot prompt:
In particular, note the final instruction to format the answer and that they also mention that they use a regex parser to extract the desired answer:
However, inspecting the details from
lighteval
I see we have the following for zero-shot:The trouble with this format is that it heavily penalises chat models which will typically produce a long-winded explanation and thus fail to produce the expected format (A,B,C,D) that a base model typically will.
Another thing I noticed is that the paper uses a fixed few-shot CoT prompt (link) which can be adapted to pure few-shot by removing the reasoning steps. However, it seems that
lighteval
samples fewshot prompts from the dataset and I wonder if it makes sense to align the evaluation in both cases (zeroshot / fewshot) in line with the paper?Happy to take a stab at this one if you agree!