mtbench101 / mt-bench-101

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Apache License 2.0
25 stars 0 forks source link

[Bug] When evaluation, {prediction} in origin_prompt is not replaced with model's response? #8

Closed liuyaox closed 2 weeks ago

liuyaox commented 3 weeks ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
 'GPU 0,1': 'NVIDIA A800-SXM4-80GB',
 'MMEngine': '0.10.4',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.10.0',
 'PyTorch': '2.3.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2023.1-Product Build 20230303 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.3.6 (Git Hash '
                              '86e6af5974177e513fd3fee58425e1063e7f1361)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]',
 'TorchVision': '0.18.1',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.5+',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

with read_base():
    from ..configs.datasets.subjective.multiround.mtbench101_judge import subjective_datasets

api_meta_template = dict(
    round=[
        dict(role='SYSTEM', api_role='SYSTEM'),
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ]
)
qwen_meta_template = dict(
    round=[
        dict(role="HUMAN", begin='<|im_start|>user\n', end='<|im_end|>\n'),
        dict(role="BOT", begin="<|im_start|>assistant\n", end='<|im_end|>\n', generate=True),
    ],
    eos_token_id=151645,
)
GPU_NUMS = 2

v1_6_path = "xxx"
models = [
    dict(
        abbr='ja_v1_6',
        type=VLLM,
        path=v1_6_path,
        model_kwargs=dict(tensor_parallel_size=GPU_NUMS),
        meta_template=qwen_meta_template,
        max_seq_len=4096,
        max_out_len=4096,
        batch_size=4,
        stop_words=['<|im_end|>'],
        run_cfg=dict(num_gpus=GPU_NUMS, num_procs=1),
    )
]
datasets = [x for x in subjective_datasets if x['abbr'] in ['mtbench101_test', 'mtbench101_ja_test']]

judge_models = [dict(
    abbr='GPT4-Turbo',
    type=OpenAI,
    path='gpt-4-1106-preview',
    key='sk-xxx',
    meta_template=api_meta_template,
    query_per_second=16,
    max_out_len=4096,
    max_seq_len=4096,
    batch_size=8,
    temperature=0.8,
)]

eval = dict(
    partitioner=dict(type=SubjectiveSizePartitioner, max_task_size=100000, mode='singlescore', models=models, judge_models=judge_models),
    runner=dict(type=LocalRunner, max_num_workers=32, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=MTBench101Summarizer, judge_type='single')

work_dir = 'outputs/mtbench101/'

Reproduces the problem - command or script

python run.py xxx/eval_subjective_mtbench101.py --debug

Reproduces the problem - error message

这是Evaluation后result目录下的json文件,注意看: 有2个case,第2个case的prediction说The assistant's response is marked as \"{prediction}\" which indicates that there is no actual response provided in the dialogue to evaluate,意思是说origin_prompt中HUMAN说的话里,那个{prediction}并没有被填充(用模型的回答),而奇怪的是,第1个case的prediction是在正常打分,打了最高分10分。

我的问题有2个:

  1. 2个case的origin_prompt里,HUMAN说的话里都有1个{prediction},此处并没有填充为模型的真实回答,这是正常的么?
  2. 第1个case打分正常,第2个case说prediction没被填充所以无法打分,为何会有这种差异?
{
    "2": {
        "origin_prompt": [
            {
                "role": "SYSTEM",
                "prompt": "Please act as an impartial judge ...... Note that only the response of the 'assistant' in the LAST ROUND of conversations is the output of the large language model (the AI assistant) that needs to be evaluated!! You must provide your explanation. After providing your explanation, please show the score by strictly following this format: 'Rating: [[score]]', for example 'Rating: [[6]]'. The DIALGUE need to be judged is in this format: \n *** \n DIALGUE \n ***\n"
            },
            {
                "role": "HUMAN",
                "prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: 現在、A、B、C の 3 人がいます。現在、A は B より背が高く、B は C より背が高いことがわかっています。現在、最も背が高いのは誰でしょうか。\n\nAssistant: 与えられた情報に基づくと、A は 3 人の中で最も背が高いです。\n\n Human: 今はあと 2 人、D と E がいます。D は B より背が高く、E は D より背が高いです。現時点で一番背が高いのは誰ですか?\n\nAssistant: 追加情報に基づいて、現時点で最も背の高い人物が誰であるかを判断することはできません。提供された情報には、個人間の相対的な身長が記載されているだけで、A、C、D、E 間の具体的な比較は示されていません。\n\n Human: さて、D が A より高いことはわかっています。現在、最も高いのは誰ですか?\n\nAssistant:  {prediction} \n ***\n\n                    The reference solution is: \n ### xxx(省略) ###\n\n"
            }
        ],
        "prediction": "The assistant's final response is correct. The assistant provided a clear logic for determining who is the tallest among the individuals in question. The response builds on the updated information that D is taller than A, and since A was initially the tallest among A, B, and C, and E is taller than D, E is concluded to be the tallest.\n\nThe reasoning is sound and follows the principles of logic. The assistant correctly integrated the historical dialogue information provided by the user, which is essential for determining the correct answer.\n\nThe assistant's communication is understandable and instructive. It uses the information provided in the conversation effectively to reach the correct conclusion. It also clearly explains the reasoning process, which aids the user's comprehension of how the answer was derived.\n\nRating: [[10]]",
        "gold": {xxx }
    },
    "3": {
        "origin_prompt": [
            {
                "role": "SYSTEM",
                "prompt": "Judge Prompt,同上"
            },
            {
                "role": "HUMAN",
                "prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: ビタミンを摂取すると健康になりますか?\n\nAssistant:  {prediction} \n ***"
            }
        ],
        "prediction": "The assistant's response is marked as \"{prediction}\" which indicates that there is no actual response provided in the dialogue to evaluate. Without a specific response from the assistant, it is impossible to determine whether the assistant would have appropriately used a rhetorical question to clarify the human's query about vitamins and health, or whether it would have provided a detailed and accurate answer. Since there is no actual response from the AI assistant to assess, it cannot be scored based on the given criteria.\n\nRating: [[1]]",
        "gold": {xxx}
    }
}

Other information

original issus: https://github.com/open-compass/opencompass/issues/1271

No response

sefira commented 3 weeks ago

感谢您对MT-Bench-101的关注!

针对你的两个问题:

2个case的origin_prompt里,HUMAN说的话里都有1个{prediction},此处并没有填充为模型的真实回答,这是正常的么?

这不正常,所有的答案都应该被填充。 我这里拉取最新的OpenCompass代码测试,并没有发现异常。 我的测试方式为将models和judge_models都设为最简单的GPT-4-API,然后从原始数据中随便找两条快速测试,运行 python run.py configs/eval_subjective_mtbench101.py --debug

models = [dict(
    abbr='GPT4-Turbo',
    type=OpenAI,
    path='gpt-4o-2024-05-13', # To compare with the official leaderboard, please use gpt-4-1106-preview
    key='',  # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
    meta_template=api_meta_template,
    query_per_second=16,
    max_out_len=4096,
    max_seq_len=4096,
    batch_size=1,
    temperature=0.8,
)]
infer = dict(
    partitioner=dict(type=SizePartitioner, max_task_size=10000),
    runner=dict(
        type=LocalRunner,
        partition='llm_dev2',
        quotatype='auto',
        max_num_workers=32,
        task=dict(type=OpenICLInferTask),
    ),
)
judge_models = [dict(
    abbr='GPT4-Turbo',
    type=OpenAI,
    path='gpt-4o-2024-05-13', # To compare with the official leaderboard, please use gpt-4-1106-preview
    key='',  # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
    meta_template=api_meta_template,
    query_per_second=16,
    max_out_len=4096,
    max_seq_len=4096,
    batch_size=8,
    temperature=0.8,
)]

结果为:

{
    "0": {
        "origin_prompt": [
            {
                "role": "SYSTEM",
                "prompt": "Please act as an impartial judge follow this instructions: xxx"
            },
            {
                "role": "HUMAN",
                "prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant:  Given that A is taller than B and B is taller than C, we can infer the following order of height:\n\n- A is taller than B.\n- B is taller than C.\n\nFrom this information, it is clear that A is the tallest among the three. \n ***\n\n                    The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}] \n ###\n\n"
            }
        ],
        "prediction": "The assistant's response is evaluated according to the given criteria:\n\n1. **Accuracy of Answer**: The assistant correctly identifies A as the tallest based on the provided information, which matches the reference solution. \n\n2. **Completeness and Clarity of Reasoning**: The assistant provides a step-by-step explanation, clearly stating that A is taller than B and B is taller than C, leading to the conclusion that A is the tallest. \n\n3. **Integration of Historical Dialogue**: The assistant effectively uses the historical dialogue to derive the correct answer.\n\n4. **Communication and Instruction**: The assistant communicates the solution in a clear and instructive manner, making it easy for the human to understand and follow the logic.\n\nOverall, the assistant's response is accurate, logically structured, and well-communicated.\n\nRating: [[10]]",
        "gold": {
            "multi_id": 1,
            "task": "GR",
            "turn_id": "1"
        }
    },
    "1": {
        "origin_prompt": [
            {
                "role": "SYSTEM",
                "prompt": "Please act as an impartial judge follow this instructions: xxx"
            },
            {
                "role": "HUMAN",
                "prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant: Based on the given information, A is the tallest among the three people.\n\n Human: Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?\n\nAssistant:  Given the updated information:\n- A is taller than B.\n- B is taller than C.\n- D is taller than B.\n- E is taller than D.\n\nTo determine the tallest among A, B, C, D, and E, we need to compare A and E, since they are the tallest in their respective sequences.\n\nHowever, we don't have any direct information comparing A and E. Based on the given information alone, we cannot definitively say who is the tallest between A and E.\n\nIf no additional information is provided, we can only conclude that either A or E is the tallest. \n ***\n\n                    The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}, {'role': 'user', 'content': 'Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?'}, {'role': 'assistant', 'content': 'Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.'}] \n ###\n\n"
            }
        ],
        "prediction": "The assistant's response correctly identifies that there is insufficient information to determine definitively who the tallest person is between A and E. It provides a logical and clear explanation for this conclusion by comparing A and E and recognizing that the relative heights between the sequences cannot be determined with the given information.\n\n### Evaluation Criteria:\n\n1. **Accuracy of the Answer:** The assistant's answer aligns with the reference solution in accurately stating that it is impossible to determine the tallest person definitively.\n2. **Reasoning Process:** The assistant provides a sound reasoning process by comparing the heights of individuals within both sequences and acknowledging the lack of direct information between A and E.\n3. **Integration of Historical Dialogue:** The assistant effectively integrates the historical dialogue and considers all the previous information provided.\n4. **Communication:** The explanation is clear, logical, and understandable, which aids in the human’s comprehension of the situation.\n\nGiven the assistant's thoroughness in addressing the problem and the clarity of its reasoning, it meets the criteria very well.\n\nRating: [[10]]",
        "gold": {
            "multi_id": 1,
            "task": "GR",
            "turn_id": "2"
        }
    },
    "2": {
        "origin_prompt": [
            {
                "role": "SYSTEM",
                "prompt": "Please act as an impartial judge follow this instructions: xxx"
            },
            {
                "role": "HUMAN",
                "prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant: Based on the given information, A is the tallest among the three people.\n\n Human: Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?\n\nAssistant: Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.\n\n Human: Now, I know that D is higher than A. Who is the highest now?\n\nAssistant:  Given the new piece of information that D is taller than A, and previously knowing that E is taller than D, E is the tallest among all the individuals (A, B, C, D, and E). \n ***\n\n                    The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}, {'role': 'user', 'content': 'Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?'}, {'role': 'assistant', 'content': 'Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.'}, {'role': 'user', 'content': 'Now, I know that D is higher than A. Who is the highest now?'}, {'role': 'assistant', 'content': 'Based on the updated information, if D is higher than A and A is already known to be taller than B and B is taller than C, E is higher than D. then E would be the tallest person at the moment.'}] \n ###\n\n"
            }
        ],
        "prediction": "The AI assistant's response accurately identifies that E is the tallest person based on the newly provided information that D is taller than A, and E is taller than D. This matches the reference solution.\n\n1. **Accuracy**: The response correctly concludes that E is the tallest. This is consistent with the reference solution.\n2. **Reasoning Process**: The AI assistant provides a clear and logical reasoning process:\n   - It restates the new information that D is taller than A.\n   - It uses the previous information that E is taller than D to conclude that E is the tallest.\n3. **Integration of Historical Dialogue**: The AI assistant integrates the historical information correctly, considering all given height comparisons.\n4. **Communication**: The explanation is clear and understandable, effectively communicating the solution.\n\nOverall, the assistant's response meets all the evaluation criteria effectively.\n\nRating: [[10]]",
        "gold": {
            "multi_id": 1,
            "task": "GR",
            "turn_id": "3"
        }
    },

Human结果填充正常,评测prediction结果正常。

第1个case打分正常,第2个case说prediction没被填充所以无法打分,为何会有这种差异?

第1个case,第2个case中,HUMAN说的话里都有1个{prediction}未被填充,两个case都不正常,只不过第1个稍微说了几轮才填充失败,第2个第一轮就填充失败,导致GPT4拒绝回答行为略有差异。 但这个问题不是主要问题,你应该优先解决第一个问题,即为什么不会填充。

Conclusion

由于我这里拉取最新的代码测试无异常,你再检查下是否有使用疏忽?以及你的模型回答是日语,感觉有点奇怪,是否引入了特殊字符导致填充失败?

liuyaox commented 2 weeks ago

你好,感谢回复!

我看在代码(https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/subjective/mtbench101.py#L245),

format(history=history, prediction='{prediction}')

history的确填充了,但prediction被填充为{prediction},相当于没填充,那应该是后面还有二次填充,这第2次填充在哪里呢,我找了半天没找到。感谢感谢

sefira commented 2 weeks ago

你好,感谢回复!

我看在代码(https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/subjective/mtbench101.py#L245),

format(history=history, prediction='{prediction}')

history的确填充了,但prediction被填充为{prediction},相当于没填充,那应该是后面还有二次填充,这第2次填充在哪里呢,我找了半天没找到。感谢感谢

Please note that here we just construct the Dataset Class which will be used later, just like OpenCompass did in MTBench (https://github.com/open-compass/opencompass/blob/842fb1cd7018151e234e471946b62aa7a2e4d26e/opencompass/datasets/subjective/mtbench.py#L106 and https://github.com/open-compass/opencompass/blob/842fb1cd7018151e234e471946b62aa7a2e4d26e/opencompass/datasets/subjective/mtbench.py#L177). You can check the code in the inference and eval phase or just turn to ask for OpenCompass Maintainer.

Thank you.