open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.21k stars 449 forks source link

为什么基于openai api部署的llama2-7b-chat-hf,在MMLU数据集上测试精度远低于官方数据 #1037

Closed wz0424 closed 6 months ago

wz0424 commented 7 months ago

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

torch 2.1.2 vllm 0.4.0 opencompass 0.2.3

重现问题 - 代码/配置示例

from opencompass.models import OpenAI
from mmengine.config import read_base
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask

with read_base():
    from ..summarizers.medium import summarizer
    from ..datasets.mmlu.mmlu_gen import mmlu_datasets

datasets = [
    *mmlu_datasets,
]

models = [
    dict(
        type=OpenAI,                             # 使用 OpenAI 模型
        # 以下为 `OpenAI` 初始化参数
        path='/huggingface.co/meta-llama/Llama-2-7b-chat-hf',                            # 指定模型类型
        openai_api_base = "http://localhost:8000/v1/chat/completions",
        key='-',                   # OpenAI API Key
        max_seq_len=2048,                        # 最大输入长度
        # 以下参数为各类模型都有的参数,非 `OpenAI` 的初始化参数
        abbr='Llama-2-7b',                            # 模型简称
        run_cfg=dict(num_gpus=0),                # 资源需求(不需要 GPU)
        max_out_len=512,                         # 最长生成长度
        batch_size=8,                            # 批次大小
    ),
]

infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=LocalAPIRunner,
        max_num_workers=4,
        concurrent_users=4,
        task=dict(type=OpenICLInferTask)),
)

work_dir = "outputs/api_openai/"

重现问题 - 命令或脚本

python run.py /home/evllm/opencompass/configs/api_examples/eval_api_openai.py

重现问题 - 错误信息

官方数据精度有45%,实际用opencompass测下来只有26%,和乱蒙没啥太大区别,是配置文件哪里写错了还是什么原因?

image

20240411_014034 tabulate format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dataset version metric mode Llama-2-7b


--------- 考试 Exam --------- - - - - ceval - - - - agieval - - - - mmlu - naive_average gen 26.57 GaokaoBench - - - - ARC-c - - - - --------- 语言 Language --------- - - - - WiC - - - - summedits - - - - chid-dev - - - - afqmc-dev - - - - bustm-dev - - - - cluewsc-dev - - - - WSC - - - - winogrande - - - - flores_100 - - - - --------- 知识 Knowledge --------- - - - - BoolQ - - - - commonsense_qa - - - - nq - - - - triviaqa - - - - --------- 推理 Reasoning --------- - - - - cmnli - - - - ocnli - - - - ocnli_fc-dev - - - - AX_b - - - - AX_g - - - - CB - - - - RTE - - - - story_cloze - - - - COPA - - - - ReCoRD - - - - hellaswag - - - - piqa - - - - siqa - - - - strategyqa - - - - math - - - - gsm8k - - - - TheoremQA - - - - openai_humaneval - - - - mbpp - - - - bbh - - - - --------- 理解 Understanding --------- - - - - C3 - - - - CMRC_dev - - - - DRCD_dev - - - - MultiRC - - - - race-middle - - - - race-high - - - - openbookqa_fact - - - - csl_dev - - - - lcsts - - - - Xsum - - - - eprstmt-dev - - - - lambada - - - - tnews-dev - - - - $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------

csv format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dataset,version,metric,mode,Llama-2-7b --------- 考试 Exam ---------,-,-,-,- ceval,-,-,-,- agieval,-,-,-,- mmlu,-,naive_average,gen,26.57 GaokaoBench,-,-,-,- ARC-c,-,-,-,- --------- 语言 Language ---------,-,-,-,- WiC,-,-,-,- summedits,-,-,-,- chid-dev,-,-,-,- afqmc-dev,-,-,-,- bustm-dev,-,-,-,- cluewsc-dev,-,-,-,- WSC,-,-,-,- winogrande,-,-,-,- flores_100,-,-,-,- --------- 知识 Knowledge ---------,-,-,-,- BoolQ,-,-,-,- commonsense_qa,-,-,-,- nq,-,-,-,- triviaqa,-,-,-,- --------- 推理 Reasoning ---------,-,-,-,- cmnli,-,-,-,- ocnli,-,-,-,- ocnli_fc-dev,-,-,-,- AX_b,-,-,-,- AX_g,-,-,-,- CB,-,-,-,- RTE,-,-,-,- story_cloze,-,-,-,- COPA,-,-,-,- ReCoRD,-,-,-,- hellaswag,-,-,-,- piqa,-,-,-,- siqa,-,-,-,- strategyqa,-,-,-,- math,-,-,-,- gsm8k,-,-,-,- TheoremQA,-,-,-,- openai_humaneval,-,-,-,- mbpp,-,-,-,- bbh,-,-,-,- --------- 理解 Understanding ---------,-,-,-,- C3,-,-,-,- CMRC_dev,-,-,-,- DRCD_dev,-,-,-,- MultiRC,-,-,-,- race-middle,-,-,-,- race-high,-,-,-,- openbookqa_fact,-,-,-,- csl_dev,-,-,-,- lcsts,-,-,-,- Xsum,-,-,-,- eprstmt-dev,-,-,-,- lambada,-,-,-,- tnews-dev,-,-,-,- $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

-------------------------------------------------------------------------------------------------------------------------------- THIS IS A DIVIDER --------------------------------------------------------------------------------------------------------------------------------

raw format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Model: Llama-2-7b lukaemon_mmlu_college_biology: {'accuracy': 17.36111111111111} lukaemon_mmlu_college_chemistry: {'accuracy': 40.0} lukaemon_mmlu_college_computer_science: {'accuracy': 32.0} lukaemon_mmlu_college_mathematics: {'accuracy': 27.0} lukaemon_mmlu_college_physics: {'accuracy': 20.588235294117645} lukaemon_mmlu_electrical_engineering: {'accuracy': 24.82758620689655} lukaemon_mmlu_astronomy: {'accuracy': 30.92105263157895} lukaemon_mmlu_anatomy: {'accuracy': 20.74074074074074} lukaemon_mmlu_abstract_algebra: {'accuracy': 28.999999999999996} lukaemon_mmlu_machine_learning: {'accuracy': 31.25} lukaemon_mmlu_clinical_knowledge: {'accuracy': 21.88679245283019} lukaemon_mmlu_global_facts: {'accuracy': 31.0} lukaemon_mmlu_management: {'accuracy': 37.86407766990291} lukaemon_mmlu_nutrition: {'accuracy': 24.18300653594771} lukaemon_mmlu_marketing: {'accuracy': 29.059829059829063} lukaemon_mmlu_professional_accounting: {'accuracy': 23.404255319148938} lukaemon_mmlu_high_school_geography: {'accuracy': 18.68686868686869} lukaemon_mmlu_international_law: {'accuracy': 23.96694214876033} lukaemon_mmlu_moral_scenarios: {'accuracy': 26.145251396648046} lukaemon_mmlu_computer_security: {'accuracy': 28.000000000000004} lukaemon_mmlu_high_school_microeconomics: {'accuracy': 20.588235294117645} lukaemon_mmlu_professional_law: {'accuracy': 25.03259452411995} lukaemon_mmlu_medical_genetics: {'accuracy': 26.0} lukaemon_mmlu_professional_psychology: {'accuracy': 22.712418300653596} lukaemon_mmlu_jurisprudence: {'accuracy': 26.851851851851855} lukaemon_mmlu_world_religions: {'accuracy': 21.637426900584796} lukaemon_mmlu_philosophy: {'accuracy': 31.189710610932476} lukaemon_mmlu_virology: {'accuracy': 24.69879518072289} lukaemon_mmlu_high_school_chemistry: {'accuracy': 26.108374384236456} lukaemon_mmlu_public_relations: {'accuracy': 20.909090909090907} lukaemon_mmlu_high_school_macroeconomics: {'accuracy': 20.76923076923077} lukaemon_mmlu_human_sexuality: {'accuracy': 26.717557251908396} lukaemon_mmlu_elementary_mathematics: {'accuracy': 21.164021164021165} lukaemon_mmlu_high_school_physics: {'accuracy': 30.4635761589404} lukaemon_mmlu_high_school_computer_science: {'accuracy': 34.0} lukaemon_mmlu_high_school_european_history: {'accuracy': 41.81818181818181} lukaemon_mmlu_business_ethics: {'accuracy': 21.0} lukaemon_mmlu_moral_disputes: {'accuracy': 25.14450867052023} lukaemon_mmlu_high_school_statistics: {'accuracy': 43.51851851851852} lukaemon_mmlu_miscellaneous: {'accuracy': 28.735632183908045} lukaemon_mmlu_formal_logic: {'accuracy': 19.047619047619047} lukaemon_mmlu_high_school_government_and_politics: {'accuracy': 19.689119170984455} lukaemon_mmlu_prehistory: {'accuracy': 20.061728395061728} lukaemon_mmlu_security_studies: {'accuracy': 17.142857142857142} lukaemon_mmlu_high_school_biology: {'accuracy': 18.387096774193548} lukaemon_mmlu_logical_fallacies: {'accuracy': 33.12883435582822} lukaemon_mmlu_high_school_world_history: {'accuracy': 24.894514767932492} lukaemon_mmlu_professional_medicine: {'accuracy': 37.13235294117647} lukaemon_mmlu_high_school_mathematics: {'accuracy': 24.814814814814813} lukaemon_mmlu_college_medicine: {'accuracy': 20.809248554913296} lukaemon_mmlu_high_school_us_history: {'accuracy': 34.80392156862745} lukaemon_mmlu_sociology: {'accuracy': 27.363184079601986} lukaemon_mmlu_econometrics: {'accuracy': 23.684210526315788} lukaemon_mmlu_high_school_psychology: {'accuracy': 35.22935779816514} lukaemon_mmlu_human_aging: {'accuracy': 20.62780269058296} lukaemon_mmlu_us_foreign_policy: {'accuracy': 28.000000000000004} lukaemon_mmlu_conceptual_physics: {'accuracy': 32.76595744680851} mmlu-humanities: {'naive_average': 27.209468158205258} mmlu-stem: {'naive_average': 28.047951855051497} mmlu-social-science: {'naive_average': 23.45767749414954} mmlu-other: {'naive_average': 26.646291737612497} mmlu: {'naive_average': 26.57066831265621} mmlu-weighted: {'weighted_average': 26.157242558040164} $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

其他信息

No response

IcyFeather233 commented 7 months ago

I don't understand this part:

models = [
    dict(
        type=OpenAI,                             # 使用 OpenAI 模型
        # 以下为 `OpenAI` 初始化参数
        path='/huggingface.co/meta-llama/Llama-2-7b-chat-hf',                            # 指定模型类型
        openai_api_base = "http://localhost:8000/v1/chat/completions",
        key='-',                   # OpenAI API Key
        max_seq_len=2048,                        # 最大输入长度
        # 以下参数为各类模型都有的参数,非 `OpenAI` 的初始化参数
        abbr='Llama-2-7b',                            # 模型简称
        run_cfg=dict(num_gpus=0),                # 资源需求(不需要 GPU)
        max_out_len=512,                         # 最长生成长度
        batch_size=8,                            # 批次大小
    ),
]

Why you use openai api in llama? i think these are 2 totally different models. You should download llama model to use. Please take llama config example for reference.

wz0424 commented 7 months ago

I deployed the llama2 with vllm to provide an openai api externally, and I would like opencompass to be able to evaluate the model by api, rather than loading the local model directly for evaluation

IcyFeather233 commented 7 months ago

oh I see, can you share the prediction result? in your outputs/xxx/predictions/ dir

wz0424 commented 7 months ago

this is a part of the prediction { "0": { "origin_prompt": "There is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA. 0\nB. 1\nC. 2\nD. 3\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nC\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find the characteristic of the ring 2Z.\nA. 0\nB. 3\nC. 12\nD. 30\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\nA. 0\nB. 4\nC. 2\nD. 6\nAnswer: ", "prediction": "Great, let's get started! Here are the answers to the questions:\n1. Answer: B\n2. Answer: C\n3. Answer: A\n4. Answer: B\n5. Answer: A\n6. Answer: C\n\nI hope this helps! Let me know if you have any questions or need further clarification.", "gold": "B" }, "1": { "origin_prompt": "There is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA. 0\nB. 1\nC. 2\nD. 3\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nC\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find the characteristic of the ring 2Z.\nA. 0\nB. 3\nC. 12\nD. 30\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of

in S_5.\nA. 8\nB. 2\nC. 24\nD. 120\nAnswer: ", "prediction": "Great, let's get started! Here are the answers to the questions:\n1. The answer is (B) 2.\n2. The answer is (C) 10.\n3. The answer is (A) 0.\n4. The answer is (B) 2.\n5. The answer is (B) 2.\n6. The answer is (A) 1.\n7. The answer is (B) 2.\n8. The answer is (C) 24.\n\nI hope this helps! Let me know if you have any questions or need further clarification.", "gold": "C" }, "2": { "origin_prompt": "There is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA. 0\nB. 1\nC. 2\nD. 3\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nB\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nC\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find the characteristic of the ring 2Z.\nA. 0\nB. 3\nC. 12\nD. 30\nAnswer: \nA\n\nThere is a single choice question about abstract algebra. Answer the question by replying A, B, C or D.\nQuestion: Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5\nA. 0\nB. 1\nC. 0,1\nD. 0,4\nAnswer: ", "prediction": "Great, let's get started! Here are the answers to the questions:\nQuestion 1: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nAnswer: B. 1\n\nQuestion 2: Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.\nAnswer: B. False, False\n\nQuestion 3: Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.\nAnswer: C. True, False\n\nQuestion 4: Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.\nAnswer: A. True, True\n\nQuestion 5: Find the characteristic of the ring 2Z.\nAnswer: A. 0\n\nQuestion 6: Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5.\nAnswer: C. 0,1", "gold": "D" },

![Uploading image.png…]()

IcyFeather233 commented 7 months ago

It seems the problem is that your prompt template not work, here is my mmlu prediction:

    "0": {
        "origin_prompt": "<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state legislature has recently enacted a statute making it a misdemeanor to curse or revile or use obscene or opprobrious language toward or in reference to a police officer perfonning his duties. A student at a state university organized a demonstration on campus to protest the war. The rally was attended by a group of 50 students who shouted anti-war messages at cars passing by. To show his contempt for the United States, the student sewed the American flag to the rear of his jeans. When a police officer saw the flag sown on the student's jeans, he approached and told him to remove the flag or he would be placed under arrest. The student became angered and shouted at the police officer, \"Listen, you bastard, I'll wear this rag anywhere I please. \" The student was subsequently placed under arrest and charged with violating the state statute. The student subsequently brings suit in state court challenging the constitutionality of the statute. The strongest constitutional argument for the student is that\nA. the statute is void for vagueness under the Fourteenth Amendment's due process clause.\nB. the statute is invalid because it violates the petitioner's freedom of speech under the First Amendment.\nC. the statute is an abridgment of freedom of speech under the First Amendment because less restrictive means are available for achieving the same purpose.\nD. the statute is overbroad and consequently invalid under the First and FourteenthAmendments.\nAnswer: <|im_end|>\n<|im_start|>assistant\nD\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state has recently enacted a statute prohibiting the disposal of any nuclear wastes within the state. This law does not contravene or conflict with any federal statutes. A man operates a company in the state that is engaged in the disposal of nuclear wastes. Subsequent to the passage of the state statute, the man, not yet aware of the new law, entered into contracts with many out-of-state firms to dispose of their nuclear wastes in the state. On account of this new law, however, the man will be unable to perform these contracts. Assume that the man has standing to challenge this state law. Which of the following presents his strongest constitutional grounds to challenge the state law prohibiting the disposal of nuclear wastes within the state?\nA. The commerce clause.\nB. The equal protection clause of the Fourteenth Amendment.\nC. The privileges and immunities clause of Article IV, Section 2. \nD. The contract clause.\nAnswer: <|im_end|>\n<|im_start|>assistant\nA\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: Judge took judicial notice of some facts at the beginning of the trial. Which of the following is not an appropriate kind of fact for judicial notice?\nA. Indisputable facts.\nB. Facts that have been asserted by individual political organizations.\nC. Facts recognized to be true by common knowledge.\nD. Facts capable of scientific verification.\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: On October 1, 1980, a developer, owner of several hundred acres in a rural county, drafted a general development plan for the area. The duly recorded plan imposed elaborate limitations and restrictions upon the land in the plan, which was to be developed as a residential district. The restrictions were to extend to all persons acquiring any of the lots and to their heirs, assigns, and lessees. It was further provided that all subsequent owners would be charged with due notice of the restrictions. Among those restrictions in the general plan were the following:(22) A franchise right is created in a strip of land 10 feet in width along the rear of each lot for the use of public utility companies with right of ingress and egress. (23) No house or structure of any kind shall be built on the aforementioned strip of land running through the said blocks. In 2000, a retiree purchased one of the lots, built a house, and erected a fence in the rear of his property within the restricted area. In 2004, a teacher purchased a lot adjacent to the retiree's property and built a new house. Two years later, a librarian purchased the lot that adjoined the teacher's property. The three deeds to those properties each contained references to the deed book where the general plan was recorded. In 2008, the librarian began the construction of a seven-foot post-and-rail fence along the line dividing his lot with the teacher's, and along the center of the area subject to the franchise right. Although the teacher objected to its construction, the fence was completed. If the teacher seeks a mandatory injunction to compel removal of the librarian's fence, the court will most likely\nA. grant relief, because the fence was in violation of the easement restriction. \nB. grant relief, because the encroachment of the fence violated the restriction in the original plan. \nC. deny relief, because the teacher failed to enforce the restriction against the retiree. \nD. deny relief, because the fence would not be construed as \"a structure\" within the terms of the restriction. \nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: One afternoon, a pilot was flying a small airplane when it suddenly ran out of gas. As he was coming in for an emergency landing, the plane crossed into a neighboring state at a very low altitude. At this time, a 9-year-old boy was walking to school when he was struck and injured by an object, which may have fallen from the plane. In federal court, a negligence suit was brought against the pilot by the father of the boy for his son. Accompanied by his father, the boy had visited an attorney for preliminary discussions regarding the case. However, the father did not retain the attorney to represent his son in the lawsuit. Instead, the father hired another lawyer to handle the case. At trial, the pilot's attorney calls the consulting attorney to testify what the boy had said to him regarding his physical condition during the consultation that the attorney had had with the boy and his father. The attorney's testimony is\nA. admissible, because the attorney-client privilege was waived by the filing of the lawsuit.\nB. admissible, because there is no privilege of confidentiality when a person other than the client is present at the attorney-client consultation.\nC. inadmissible, because the attorney-client privilege prevents such a breach of confidential communications.\nD. inadmissible, because it was a statement of physical condition not made for the purpose of obtaining medical treatment.\nAnswer: <|im_end|>\n<|im_start|>assistant\n",
        "prediction": "C\n",
        "gold": "C"
    },
    "1": {
        "origin_prompt": "<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state legislature has recently enacted a statute making it a misdemeanor to curse or revile or use obscene or opprobrious language toward or in reference to a police officer perfonning his duties. A student at a state university organized a demonstration on campus to protest the war. The rally was attended by a group of 50 students who shouted anti-war messages at cars passing by. To show his contempt for the United States, the student sewed the American flag to the rear of his jeans. When a police officer saw the flag sown on the student's jeans, he approached and told him to remove the flag or he would be placed under arrest. The student became angered and shouted at the police officer, \"Listen, you bastard, I'll wear this rag anywhere I please. \" The student was subsequently placed under arrest and charged with violating the state statute. The student subsequently brings suit in state court challenging the constitutionality of the statute. The strongest constitutional argument for the student is that\nA. the statute is void for vagueness under the Fourteenth Amendment's due process clause.\nB. the statute is invalid because it violates the petitioner's freedom of speech under the First Amendment.\nC. the statute is an abridgment of freedom of speech under the First Amendment because less restrictive means are available for achieving the same purpose.\nD. the statute is overbroad and consequently invalid under the First and FourteenthAmendments.\nAnswer: <|im_end|>\n<|im_start|>assistant\nD\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state has recently enacted a statute prohibiting the disposal of any nuclear wastes within the state. This law does not contravene or conflict with any federal statutes. A man operates a company in the state that is engaged in the disposal of nuclear wastes. Subsequent to the passage of the state statute, the man, not yet aware of the new law, entered into contracts with many out-of-state firms to dispose of their nuclear wastes in the state. On account of this new law, however, the man will be unable to perform these contracts. Assume that the man has standing to challenge this state law. Which of the following presents his strongest constitutional grounds to challenge the state law prohibiting the disposal of nuclear wastes within the state?\nA. The commerce clause.\nB. The equal protection clause of the Fourteenth Amendment.\nC. The privileges and immunities clause of Article IV, Section 2. \nD. The contract clause.\nAnswer: <|im_end|>\n<|im_start|>assistant\nA\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: Judge took judicial notice of some facts at the beginning of the trial. Which of the following is not an appropriate kind of fact for judicial notice?\nA. Indisputable facts.\nB. Facts that have been asserted by individual political organizations.\nC. Facts recognized to be true by common knowledge.\nD. Facts capable of scientific verification.\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: On October 1, 1980, a developer, owner of several hundred acres in a rural county, drafted a general development plan for the area. The duly recorded plan imposed elaborate limitations and restrictions upon the land in the plan, which was to be developed as a residential district. The restrictions were to extend to all persons acquiring any of the lots and to their heirs, assigns, and lessees. It was further provided that all subsequent owners would be charged with due notice of the restrictions. Among those restrictions in the general plan were the following:(22) A franchise right is created in a strip of land 10 feet in width along the rear of each lot for the use of public utility companies with right of ingress and egress. (23) No house or structure of any kind shall be built on the aforementioned strip of land running through the said blocks. In 2000, a retiree purchased one of the lots, built a house, and erected a fence in the rear of his property within the restricted area. In 2004, a teacher purchased a lot adjacent to the retiree's property and built a new house. Two years later, a librarian purchased the lot that adjoined the teacher's property. The three deeds to those properties each contained references to the deed book where the general plan was recorded. In 2008, the librarian began the construction of a seven-foot post-and-rail fence along the line dividing his lot with the teacher's, and along the center of the area subject to the franchise right. Although the teacher objected to its construction, the fence was completed. If the teacher seeks a mandatory injunction to compel removal of the librarian's fence, the court will most likely\nA. grant relief, because the fence was in violation of the easement restriction. \nB. grant relief, because the encroachment of the fence violated the restriction in the original plan. \nC. deny relief, because the teacher failed to enforce the restriction against the retiree. \nD. deny relief, because the fence would not be construed as \"a structure\" within the terms of the restriction. \nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state statute provides: \"Whenever a person knows or should know that he (or she) is being arrested by a police officer, it is the duty of such person to refrain from using force or any weapon in resisting arrest. \" Violation of the statute is made punishable by fine and/or imprisonment. One morning, there was a bank robbery in the state. That afternoon, a police officer arrested a suspect who he believed was involved in the crime. However, the police officer and the suspect have given different accounts concerning what happened next. According to the police officer, after the suspect was apprehended, he resisted arrest and hit the police officer in the mouth with his fist. The police officer, who was momentarily stunned, pulled out his nightstick and struck the suspect over the head with it. On the other hand, the suspect claimed that after he was arrested, he cursed at the policeman, whereupon the police officer began hitting the suspect with his nightstick. To avoid being hit again, the suspect hit the police officer with his fist, knocking him down. The suspect was charged with assault. The suspect should be found\nA. not guilty, if the arrest was unlawful without probable cause and the jury believes the suspect's account.\nB. not guilty, if the arrest was lawful, provided that the jury believes the suspect's account.\nC. guilty, if the arrest was lawful, regardless which account the jury believes.\nD. guilty, if the arrest was unlawful, regardless which account the jury believes.\nAnswer: <|im_end|>\n<|im_start|>assistant\n",
        "prediction": "B\n",
        "gold": "B"
    },
    "2": {
        "origin_prompt": "<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state legislature has recently enacted a statute making it a misdemeanor to curse or revile or use obscene or opprobrious language toward or in reference to a police officer perfonning his duties. A student at a state university organized a demonstration on campus to protest the war. The rally was attended by a group of 50 students who shouted anti-war messages at cars passing by. To show his contempt for the United States, the student sewed the American flag to the rear of his jeans. When a police officer saw the flag sown on the student's jeans, he approached and told him to remove the flag or he would be placed under arrest. The student became angered and shouted at the police officer, \"Listen, you bastard, I'll wear this rag anywhere I please. \" The student was subsequently placed under arrest and charged with violating the state statute. The student subsequently brings suit in state court challenging the constitutionality of the statute. The strongest constitutional argument for the student is that\nA. the statute is void for vagueness under the Fourteenth Amendment's due process clause.\nB. the statute is invalid because it violates the petitioner's freedom of speech under the First Amendment.\nC. the statute is an abridgment of freedom of speech under the First Amendment because less restrictive means are available for achieving the same purpose.\nD. the statute is overbroad and consequently invalid under the First and FourteenthAmendments.\nAnswer: <|im_end|>\n<|im_start|>assistant\nD\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A state has recently enacted a statute prohibiting the disposal of any nuclear wastes within the state. This law does not contravene or conflict with any federal statutes. A man operates a company in the state that is engaged in the disposal of nuclear wastes. Subsequent to the passage of the state statute, the man, not yet aware of the new law, entered into contracts with many out-of-state firms to dispose of their nuclear wastes in the state. On account of this new law, however, the man will be unable to perform these contracts. Assume that the man has standing to challenge this state law. Which of the following presents his strongest constitutional grounds to challenge the state law prohibiting the disposal of nuclear wastes within the state?\nA. The commerce clause.\nB. The equal protection clause of the Fourteenth Amendment.\nC. The privileges and immunities clause of Article IV, Section 2. \nD. The contract clause.\nAnswer: <|im_end|>\n<|im_start|>assistant\nA\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: Judge took judicial notice of some facts at the beginning of the trial. Which of the following is not an appropriate kind of fact for judicial notice?\nA. Indisputable facts.\nB. Facts that have been asserted by individual political organizations.\nC. Facts recognized to be true by common knowledge.\nD. Facts capable of scientific verification.\nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: On October 1, 1980, a developer, owner of several hundred acres in a rural county, drafted a general development plan for the area. The duly recorded plan imposed elaborate limitations and restrictions upon the land in the plan, which was to be developed as a residential district. The restrictions were to extend to all persons acquiring any of the lots and to their heirs, assigns, and lessees. It was further provided that all subsequent owners would be charged with due notice of the restrictions. Among those restrictions in the general plan were the following:(22) A franchise right is created in a strip of land 10 feet in width along the rear of each lot for the use of public utility companies with right of ingress and egress. (23) No house or structure of any kind shall be built on the aforementioned strip of land running through the said blocks. In 2000, a retiree purchased one of the lots, built a house, and erected a fence in the rear of his property within the restricted area. In 2004, a teacher purchased a lot adjacent to the retiree's property and built a new house. Two years later, a librarian purchased the lot that adjoined the teacher's property. The three deeds to those properties each contained references to the deed book where the general plan was recorded. In 2008, the librarian began the construction of a seven-foot post-and-rail fence along the line dividing his lot with the teacher's, and along the center of the area subject to the franchise right. Although the teacher objected to its construction, the fence was completed. If the teacher seeks a mandatory injunction to compel removal of the librarian's fence, the court will most likely\nA. grant relief, because the fence was in violation of the easement restriction. \nB. grant relief, because the encroachment of the fence violated the restriction in the original plan. \nC. deny relief, because the teacher failed to enforce the restriction against the retiree. \nD. deny relief, because the fence would not be construed as \"a structure\" within the terms of the restriction. \nAnswer: <|im_end|>\n<|im_start|>assistant\nB\n<|im_end|>\n<|im_start|>user\nThere is a single choice question about professional law. Answer the question by replying A, B, C or D.\nQuestion: A taxpayer was notified by the government that her individual income tax was underpaid by $1,012. 69. The taxpayer retained an attorney to represent her in contesting the assessment. During the preparation of his client's case, the attorney suggested that it might be a good idea if the taxpayer hired an accountant to organize her records and prepare a financial statement for the year in question. Following the attorney's advice, the taxpayer consulted an accountant, an old family friend, who prepared a financial statement, which the attorney referred to at trial. During the trial, the attorney representing the government called the accountant to testif' about statements that the taxpayer made to him. The accountant's proposed testimony is\nA. inadmissible, because it would violate the attorney-client privilege.\nB. inadmissible, because it would violate the taxpayer's privilege against self-incrimination.\nC. inadmissible as violative of the work-product rule.\nD. admissible as an admission.\nAnswer: <|im_end|>\n<|im_start|>assistant\n",
        "prediction": "A\n",
        "gold": "D"
    },

As you see, the output is only ABCD options

wz0424 commented 7 months ago

Do you know why this problem occurs? I observed the configuration file and there seems to be no problem with the prompt template, but it just does not take effect. The data set config I used is mmlu_gen_4d595a.py, and the output config is as follows:

dict(abbr='lukaemon_mmlu_college_biology', eval_cfg=dict( evaluator=dict( type='opencompass.openicl.icl_evaluator.AccEvaluator'), pred_postprocessor=dict( options='ABCD', type='opencompass.utils.text_postprocessors.first_option_postprocess')), infer_cfg=dict( ice_template=dict( template=dict( round=[ dict(prompt='There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: {input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nAnswer: ', role='HUMAN'), dict(prompt='{target}\n', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), inferencer=dict( type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( ice_token='', template=dict( begin='', round=[ dict(prompt='There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: {input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nAnswer: ', role='HUMAN'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( fix_id_list=[ 0, 1, 2, 3, 4, ], type='opencompass.openicl.icl_retriever.FixKRetriever')), name='college_biology', path='./data/mmlu/', reader_cfg=dict( input_columns=[ 'input', 'A', 'B', 'C', 'D', ], output_column='target', train_split='dev'), type='opencompass.datasets.MMLUDataset'),

IcyFeather233 commented 7 months ago

Maybe you need some tools to check, try https://github.com/open-compass/opencompass/blob/main/tools/prompt_viewer.py to see the prompt feed to your model? (Actually I don't know how to check prompt, but this tools looks like that it can be tried from its name, I am just a user and not a developer)

linxxx3 commented 6 months ago

Do you know why this problem occurs? I observed the configuration file and there seems to be no problem with the prompt template, but it just does not take effect. The data set config I used is mmlu_gen_4d595a.py, and the output config is as follows:

dict(abbr='lukaemon_mmlu_college_biology', eval_cfg=dict( evaluator=dict( type='opencompass.openicl.icl_evaluator.AccEvaluator'), pred_postprocessor=dict( options='ABCD', type='opencompass.utils.text_postprocessors.first_option_postprocess')), infer_cfg=dict( ice_template=dict( template=dict( round=[ dict(prompt='There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: {input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nAnswer: ', role='HUMAN'), dict(prompt='{target}\n', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), inferencer=dict( type='opencompass.openicl.icl_inferencer.GenInferencer'), prompt_template=dict( ice_token='', template=dict( begin='', round=[ dict(prompt='There is a single choice question about college biology. Answer the question by replying A, B, C or D.\nQuestion: {input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nAnswer: ', role='HUMAN'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( fix_id_list=[ 0, 1, 2, 3, 4, ], type='opencompass.openicl.icl_retriever.FixKRetriever')), name='college_biology', path='./data/mmlu/', reader_cfg=dict( input_columns=[ 'input', 'A', 'B', 'C', 'D', ], output_column='target', train_split='dev'), type='opencompass.datasets.MMLUDataset'),

because you are testing a "chat" model, which needs a specific "chat template" to work. not just the prompt template of the evalution. from the previous comments, you could figure out the "<|im_start|>" things, that's important. you are using the chat-completion api from openai, but probably not applied the correct chat template when configuring the inference service. but the "<|im_start|>" token which contains in the "chatml" format, is not used by llama chat model, as far as i known. llama chat model has another "chat template".

zjintheroom commented 4 months ago

您知道为什么会出现此问题吗?我观察了配置文件,提示模板似乎没有问题,但就是没有生效。我使用的数据集配置是mmlu_gen_4d595a.py,输出配置如下: dict(abbr='lukaemon_mmlu_college_biology', eval_cfg=dict( evaluator=dict( type='opencompass.openicl.icl_evaluator.AccEvaluator'), pred_postprocessor=dict( options='ABCD', type='opencompass.utils.text_postprocessors.first_option_postprocess')), infer_cfg=dict( ice_template=dict( template=dict( round=[ dict(prompt='有一个关于大学生物学的单选择题。通过回答 A、B、C 或 D 来回答问题。\n问题:{input}\nA。{A}\nB.{B}\nC.{C}\nD.{D}\nAnswer: ', role='HUMAN'), dict(prompt='{target}\n', role='BOT'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate')、inferencer=dict( type='opencompass.openicl.icl_inferencer。GenInferencer'), prompt_template=dict( ice_token='', template=dict( begin='', round=[ dict(prompt='有一个关于大学生物学的单选题。通过回答 A、B、C 或 D 来回答问题。\n问题:{input}\nA。{A}\nB.{B}\nC.{C}\nD.{D}\nAnswer: ', role='HUMAN'), ]), type='opencompass.openicl.icl_prompt_template.PromptTemplate'), retriever=dict( fix_id_list=[ 0, 1, 2, 3, 4, ], type='opencompass.openicl.icl_retriever.FixKRetriever')), name='college_biology', path='./data/mmlu/', reader_cfg=dict( input_columns=[ 'input', 'A', 'B', 'C', 'D', ], output_column='target', train_split='dev'), type='opencompass.datasets.MMLUDataset'),

因为你正在测试一个“聊天”模型,它需要一个特定的“聊天模板”才能工作。而不仅仅是评估的提示模板。从前面的评论中,您可以弄清楚“<|im_start|>”的事情,这很重要。您正在使用 OpenAI 的聊天完成 API,但在配置推理服务时可能没有应用正确的聊天模板。但是据我所知,包含在“ChatML”格式中的“<|im_start|>”令牌并未被 LLAMA 聊天模型使用。Llama 聊天模型还有另一个“聊天模板”。

想问一下,使用vllm起的openai的借口,并且起的模型中tokenizer的chat template参数已经定义好了。那么在评测中,还需要加 prompt模版么