Zero scores on cnn-dm benchmark from HELM

hicleo commented 4 months ago

When running the evaluation of Sheared-LLaMA-1.3B and original LLaMA-7B on helm|summarization:cnn-dm, I get zero scores:

accelerate launch --multi_gpu --num_processes=3 run_evals_accelerate.py --model_args "pr
etrained=princeton-nlp/Sheared-LLaMA-1.3B,model_parallel=True" --task "helm|summarization:cnn-dm|0|0" --override_batch_size 1 --out
put_dir "./evals/"

Output:

|           Task            |Version|         Metric          |Value|   |Stderr|
|---------------------------|------:|-------------------------|----:|---|-----:|
|all                        |       |rouge1                   |    0|±  |     0|
|                           |       |rouge2                   |    0|±  |     0|
|                           |       |rougeL                   |    0|±  |     0|
|                           |       |summac                   |    0|±  |     0|
|                           |       |summarization_coverage   |    0|±  |     0|
|                           |       |summarization_density    |    0|±  |     0|
|                           |       |summarization_compression|    0|±  |     0|
|helm:summarization:cnn-dm:0|      0|rouge1                   |    0|±  |     0|
|                           |       |rouge2                   |    0|±  |     0|
|                           |       |rougeL                   |    0|±  |     0|
|                           |       |summac                   |    0|±  |     0|
|                           |       |summarization_coverage   |    0|±  |     0|
|                           |       |summarization_density    |    0|±  |     0|
|                           |       |summarization_compression|    0|±  |     0|

hicleo commented 4 months ago

Seems to have something to do with task_prompt_formatting Change the prompt as follows:

instruction="### Instruction: Summarize the following passage in 3 sentences.\n", 
query=f"### Instruction: Summarize the following passage in 3 sentences.\n### Passage: {line['article']}\n### Summary: ",

And this issue can be fixed

clefourrier commented 4 months ago

Thanks so much for debugging, would you be OK with opening a PR to share this fix with the community?

hicleo commented 3 months ago

Not sure if it has something to do with the evaluated model itself. When I use another finetuned model, the original code seems to be fine. Maybe adjusting the task_prompt_formatting ourselves according to requirements is needed.

huggingface / lighteval

Zero scores on cnn-dm benchmark from HELM #188