Allowing a single prompt to use several formats for one eval.

clefourrier commented 3 days ago

This can be tested with the following custom task.

from lighteval.metrics.metrics import Metrics
from lighteval.tasks.default_prompts import LETTER_INDICES
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc

def prompt_fn(line, task_name: str = None):
    """Defines how to go from a dataset line to a doc object.
    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
    about what this function should do in the README.
    """

    ix = line["__index"]

    if ix % 3 == 0:
        # question, must predict correct choice
        return Doc(
            task_name=task_name,
            query=f"Question: {line['question']}\nAnswer:",
            choices=[f" {c}" for c in line["choices"]["text"]],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )

    if (ix + 1) % 3 == 0:
        # question + label + choice, must predict correct label
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n{key}. {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=LETTER_INDICES[: len(line["choices"]["text"])],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )

    if (ix + 2) % 3 == 0:
        # question + choice, must predict correct choice
        query = f"Question: {line['question']}\n"
        query += "".join([f"\n- {choice}" for key, choice in zip(LETTER_INDICES, line["choices"]["text"])])
        query += "\nAnswer:"
        return Doc(
            task_name=task_name,
            query=query,
            choices=line["choices"]["text"],
            gold_index=line["choices"]["label"].index(line["answerKey"]),
        )

task = LightevalTaskConfig(
    name="arc_multi_prompts",
    prompt_function=prompt_fn,
    hf_repo="ai2_arc",
    hf_subset="ARC-Challenge",
    evaluation_splits=["test"],
    generation_size=1,
    metric=[Metrics.loglikelihood_acc, Metrics.loglikelihood_acc_norm_nospace],
    trust_dataset=True,
    stop_sequence=["\n"],
)
TASKS_TABLE = [task]

if __name__ == "__main__":
    print(t.name for t in TASKS_TABLE)
    print(len(TASKS_TABLE))

Only question is the management of few shot: do we want to force few-shot samples to have the same format as the current question, or do we just assume that these tests will be run 0 shot?

If we force the few shot samples, that means we'll need to reload them, or change the loading system, since we fix the few shot sample shape at creation.

anton-l commented 2 days ago

My random 2 cents: if we're doing few-shots too, it would be useful to see the metrics separately for each prompt style too. And at that point we could just define them as separate tasks, but maybe I'm misunderstanding the use case :thinking:

clefourrier commented 2 days ago

The idea is that, for some evaluations, some models are overfitting a given prompt format - so we want to evaluate a single evaluation on a range of prompt formats to mitigate this bias.

anton-l commented 2 days ago

Ok we're on the same page then! Currently I'm checking that by copying/generating the tasks with different prompt functions and looking at the _average results as well as the individual prompts' results to see which ones skew them.

clefourrier commented 14 hours ago

Hm if you wanted to do this the best would do to define one task per prompt function instead I think

NathanHB commented 12 hours ago

Not sure how to implement prompt variation for fewshot but I think we will need it

clefourrier commented 12 hours ago

We would need to refacto entirely the few shot management. At the moment, all few shot docs are loaded once, but here we would need to dynamically apply the formatting. It's doable but will be a bit messy imo.

huggingface / lighteval

Allowing a single prompt to use several formats for one eval. #398