open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.2k stars 449 forks source link

[Bug] MBPP evaluator cannot extract the correct anwser #1407

Open guoshengCS opened 3 months ago

guoshengCS commented 3 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

torch2.2.0+vllm-0.4.0

Reproduces the problem - code/configuration sample

evaluate mbpp + qwen2-72b-vllm with following config

from mmengine.config import read_base

with read_base():
    from ...mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

from opencompass.models import VLLM

models = [
    dict(
        type=VLLM,
        abbr='qwen2-72b-vllm',
        path='Qwen/Qwen2-72B',
        model_kwargs=dict(tensor_parallel_size=4),
        max_out_len=1024,
        max_seq_len=8192,
        batch_size=16,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=4),
    )
]

Reproduces the problem - command or script

evaluate mbpp + qwen2-72b-vllm with following config

Reproduces the problem - error message

Unexpected mbpp score compared with https://qwenlm.github.io/blog/qwen2/

dataset                                 version    metric    mode    qwen2-72b-vllm
--------------------------------------  ---------  --------  ------  ----------------
Overall                                 -          -         -       -
Exam                                    -          -         -       -
Language                                -          -         -       -
Knowledge                               -          -         -       -
Understanding                           -          -         -       -
Reasoning                               -          -         -       -
--------- 考试 Exam ---------           -          -         -       -
ceval                                   -          -         -       -
agieval                                 -          -         -       -
mmlu                                    -          -         -       -
cmmlu                                   -          -         -       -
GaokaoBench                             -          -         -       -
ARC-c                                   -          -         -       -
ARC-e                                   -          -         -       -
--------- 语言 Language ---------       -          -         -       -
WiC                                     -          -         -       -
chid-dev                                -          -         -       -
afqmc-dev                               -          -         -       -
WSC                                     -          -         -       -
tydiqa-goldp                            -          -         -       -
flores_100                              -          -         -       -
--------- 知识 Knowledge ---------      -          -         -       -
BoolQ                                   -          -         -       -
commonsense_qa                          -          -         -       -
triviaqa                                -          -         -       -
nq                                      -          -         -       -
--------- 理解 Understanding ---------  -          -         -       -
C3                                      -          -         -       -
race-middle                             -          -         -       -
race-high                               -          -         -       -
openbookqa_fact                         -          -         -       -
csl_dev                                 -          -         -       -
lcsts                                   -          -         -       -
Xsum                                    -          -         -       -
eprstmt-dev                             -          -         -       -
lambada                                 -          -         -       -
--------- 推理 Reasoning ---------      -          -         -       -
cmnli                                   -          -         -       -
ocnli                                   -          -         -       -
AX_b                                    -          -         -       -
AX_g                                    -          -         -       -
RTE                                     -          -         -       -
COPA                                    -          -         -       -
ReCoRD                                  -          -         -       -
hellaswag                               -          -         -       -
piqa                                    -          -         -       -
siqa                                    -          -         -       -
math                                    -          -         -       -
mathbench-arithmetic-cloze_en           -          -         -       -
mathbench-primary-cloze_cn              -          -         -       -
gsm8k                                   -          -         -       -
drop                                    -          -         -       -
openai_humaneval                        -          -         -       -
mbpp                                    1e1056     score     gen     11.40
bbh                                     -          -         -       -

Other information

A prediction case use qwen2-72b of mbpp is as following:

    "0": {
        "origin_prompt": "You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n\n[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n  res = tuple(set(test_tup1) & set(test_tup2))\r\n  return (res)' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n\n[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n    result = False\r\n    for i in range(2,int(math.sqrt(n)) + 1):\r\n        if n % i == 0:\r\n            result = True\r\n    return result' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n\n[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n  largest_nums = hq.nlargest(n, nums)\r\n  return largest_nums' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to remove first and last occurrence of a given character from the string. Your code should pass these tests:\n\n assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\"  \n\n[BEGIN]\n",
        "prediction": " 'def remove_Occ(str, char):\r\n  first = str.find(char)\r\n  last = str.rfind(char)\r\n  if first == -1:\r\n    return str\r\n  elif first == last:\r\n    return str[:first] + str[first+1:]\r\n  else:\r\n    return str[:first] + str[first+1:last] + str[last+1:]' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n  largest_num = max(nums)\r\n  return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your task: Write a python function to find the largest number from a given list of numbers. Your code should pass these tests:\n\n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58]) == 85 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90]) == 90 \n assert largest_number([25, 35, 22, 85, 14, 65, 75, 22, 58, 90, 100]) == 100 \n\n[BEGIN]\n 'def largest_number(nums):\r\n  largest_num = max(nums)\r\n  return largest_num' \n[DONE] \n\n \nYou are an expert Python programmer, and here is your",
        "gold": "assert remove_Occ(\"hello\",\"l\") == \"heo\"\nassert remove_Occ(\"abcda\",\"a\") == \"bcd\"\nassert remove_Occ(\"PHP\",\"P\") == \"H\""
    }

As we can see, the prediction does not start with [BEGIN] which is the ending string of input prompt by https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23

mbpp_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\nassert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \nassert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n  res = tuple(set(test_tup1) & set(test_tup2))\r\n  return (res)' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \nassert is_not_prime(10) == True \nassert is_not_prime(35) == True \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n    result = False\r\n    for i in range(2,int(math.sqrt(n)) + 1):\r\n        if n % i == 0:\r\n            result = True\r\n    return result' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \nassert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n'),
                dict(role='BOT', prompt="[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n  largest_nums = hq.nlargest(n, nums)\r\n  return largest_nums' \n[DONE] \n\n "),

                dict(role='HUMAN', prompt='You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list}  \n'),
                dict(role='BOT', prompt='[BEGIN]\n'),
            ],
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer, max_out_len=512),
)

However, MBPPEvaluator extract answers with patterns starting with [BEGIN] which get the non-first program among multiple program cases given by the base LLM model

    def _process_answer(self, text):
        patterns = [
            r"\[BEGIN\]\s*'(.*)'\s*\[DONE\]",
            r"BEGIN\s*'(.*)'\s*\[DONE\]",
            r"\[BEGIN\]\s*'(.*)'\s*DONE",
            r"BEGIN\s*'(.*)'\s*DONE",
            r"\[BEGIN\]\s*'(.*)\s*\[DONE\]",
            r"BEGIN\s*'(.*)\s*\[DONE\]",
            r"\[BEGIN\]\s*'(.*)\s*DONE",
            r"BEGIN\s*'(.*)\s*DONE",
            r'\[BEGIN\]\s*(.*)\s*\[DONE\]',
            r'BEGIN\s*(.*)\s*\[DONE\]',
            r'\[BEGIN\]\s*(.*)\s*DONE',
            r'BEGIN\s*(.*)\s*DONE',
            r'```python\s*(.*)\s*```',
            r'```\s*(.*)\s*```',
            r'```python\s*(.*)\s*$',
            r'```\s*(.*)\s*$',
            r'(.*)\s*```.*',
            r"\[BEGIN\]\s*'(.*)",
            r'\[BEGIN\](.*)',
            r"'(.*)'\s*\[DONE\]",
        ]
        for p in patterns:
            match = re.search(p, text, re.DOTALL)
            if match:
                text = match.group(1)
                break
        text = text.split('```')[0]
        text = re.split(r"'?\s*\[?DONE\]?", text)[0]
        text = text.replace('\\_', '_')
        text = text.strip()
        return text
tonysy commented 3 months ago

your used prompt template has been deprecated, please try configs/datasets/mbpp/mbpp_gen_830460.py

guoshengCS commented 3 months ago

configs/datasets/mbpp/mbpp_gen_830460.py

Thanks for the quick reply! @tonysy

It seems has the same problem since the input prompt ends with [BEGIN] https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/mbpp_gen_830460.py#L23 , thus the response would not start with it, while MBPPEvaluator only extract [BEGIN] started anwser.

tonysy commented 3 months ago

Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.

FlyCarrot commented 3 months ago

Got it, I think the prompt is designed for base model and we may need to upgrade the prompt compatible with instruct model.

hello, has this bug been fixed?