Closed clefourrier closed 12 hours ago
My random 2 cents: if we're doing few-shots too, it would be useful to see the metrics separately for each prompt style too. And at that point we could just define them as separate tasks, but maybe I'm misunderstanding the use case :thinking:
The idea is that, for some evaluations, some models are overfitting a given prompt format - so we want to evaluate a single evaluation on a range of prompt formats to mitigate this bias.
Ok we're on the same page then! Currently I'm checking that by copying/generating the tasks with different prompt functions and looking at the _average
results as well as the individual prompts' results to see which ones skew them.
Hm if you wanted to do this the best would do to define one task per prompt function instead I think
Not sure how to implement prompt variation for fewshot but I think we will need it
We would need to refacto entirely the few shot management. At the moment, all few shot docs are loaded once, but here we would need to dynamically apply the formatting. It's doable but will be a bit messy imo.
This can be tested with the following custom task.
Only question is the management of few shot: do we want to force few-shot samples to have the same format as the current question, or do we just assume that these tests will be run 0 shot?
If we force the few shot samples, that means we'll need to reload them, or change the loading system, since we fix the few shot sample shape at creation.