open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.2k stars 449 forks source link

[Bug] About the output length of MT-Bench #1307

Closed MrZhengXin closed 4 months ago

MrZhengXin commented 4 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0',
 'GPU 0,1,2,3': 'NVIDIA H800',
 'MMEngine': '0.10.4',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.9.0',
 'PyTorch': '2.3.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.3.6 (Git Hash '
                              '86e6af5974177e513fd3fee58425e1063e7f1361)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
 'TorchVision': '0.18.0+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.5+4b451d1',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

from mmengine.config import read_base

with read_base():
    from .datasets.subjective.multiround.mtbench_single_judge_diff_temp import subjective_datasets

from opencompass.models import VLLMwithChatTemplate, HuggingFacewithChatTemplate, HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import MTBenchSummarizer

api_meta_template = dict(
    round=[
        dict(role='SYSTEM', api_role='SYSTEM'),
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ]
)

# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
    dict(
        type=VLLMwithChatTemplate,
        abbr='qwen2-72b-instruct-vllm',
        path='Qwen/Qwen2-72B-Instruct',
        model_kwargs=dict(tensor_parallel_size=4),
        max_seq_len=8192,
        batch_size=8,
        run_cfg=dict(num_gpus=4),
    )
]

datasets = [*subjective_datasets]

# -------------Evalation Stage ----------------------------------------

## ------------- JudgeLLM Configuration
judge_models = [dict(
    abbr='GPT4-Turbo',
    type=OpenAI,
    path='gpt-4-0613', # To compare with the official leaderboard, please use gpt4-0613
    key='xxxx',  # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
    meta_template=api_meta_template,
    query_per_second=16,
    max_out_len=2048,
    max_seq_len=8192,
    batch_size=8,
    temperature=0,
)]

## single evaluation
eval = dict(
    partitioner=dict(type=SubjectiveSizePartitioner, strategy='split', max_task_size=10000, mode='singlescore', models=models, judge_models=judge_models),
    runner=dict(type=LocalRunner, max_num_workers=32, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=MTBenchSummarizer, judge_type='single')

work_dir = 'outputs/mtbench/'

Reproduces the problem - command or script

python3 run.py configs/eval_subjective_mtbench.py -r 

Reproduces the problem - error message

Incomplete generation of Qwen2-72B-Instruct.

[
    "25": {
        "prediction": [
            {
                "round": 1,
                "user": "Implement a function to find the median of two sorted arrays of different sizes with O(1) space complexity and O(n) time complexity.",
                "assistant": "To find the median of two sorted arrays of different sizes, we can use a modified version of the merge algorithm used in merge sort. The idea is to merge the two arrays until we reach the middle element(s) in O(n) time complexity and O(1) space complexity.\n\nHere's the implementation in Python:\n\n```python\ndef findMedianSortedArrays(nums1, nums2):\n    # Ensure nums1 is the smaller array\n    if len(nums1) > len(nums2):\n        return findMedianSortedArrays(nums2, nums1)\n    \n    x, y = len(nums1), len(nums2)\n    low, high = 0, x\n    \n    while low <= high:\n        partitionX = (low + high) // 2\n        partitionY = (x + y + 1) // 2 - partitionX\n        \n        maxLeftX = float('-inf') if partitionX == 0 else nums1[partitionX - 1]\n        minRightX = float('inf') if partitionX == x else nums1[partitionX]\n        \n        maxLeftY = float('-inf') if partitionY == 0 else nums2[partitionY - 1]\n        minRightY = float('inf') if partitionY == y else nums2[partitionY]\n        \n        if maxLeftX <= minRightY and maxLeftY <= minRightX:\n            if (x + y) % 2 == 0:\n                return (max(maxLeftX, maxLeftY) + min(minRightX, minRightY)) / 2\n            else:\n                return max(maxLeftX, maxLeftY)\n        elif maxLeftX > minRightY:\n            high = partitionX - 1\n        else:\n            low = partitionX + 1\n```\n\nThis implementation uses binary search to find the correct partition in the smaller array, which ensures O(log(min(n, m))) time complexity, where n and m are the sizes of the two arrays. However, since the binary search is performed on the smaller array and the merging is done implicitly, the overall time complexity remains O(n).\n\nThe space complexity is O(1) as we only use a constant amount of extra space for variables.\n\nNote that the implementation above assumes that both arrays are non-empty.\n\nHere's an example usage:\n\n```python\nnums1 = [1, 3, 5]\nnums2 = [2, 4, 6, 8]\nmedian = findMedian"
            },
            {
                "round": 2,
                "user": "Does there exist an implementation with better time complexity?",
                "assistant": "Actually, the time complexity of the provided algorithm is O(log(min(m, n))), where m and n are the lengths of the two arrays. This is because we're using a binary search algorithm, which has logarithmic time complexity.\n\nTo answer your question, this is the most efficient algorithm for finding the median of two sorted arrays in terms of time complexity. It is not possible to find the median of two sorted arrays in less than O(log(min(m, n))) time, which makes this algorithm optimal.\n\nHere is the optimal algorithm:\n\n```python\ndef findMedianSortedArrays(nums1, nums2):\n    if len(nums1) > len(nums2):\n        nums1, nums2 = nums2, nums1\n\n    x, y = len(nums1), len(nums2)\n    low = 0\n    high = x\n\n    while low <= high:\n        partitionX = (low + high) // 2\n        partitionY = (x + y + 1) // 2 - partitionX\n\n        maxLeftX = float('-inf') if partitionX == 0 else nums1[partitionX - 1]\n        minRightX = float('inf') if partitionX == x else nums1[partitionX]\n\n        maxLeftY = float('-inf') if partitionY == 0 else nums2[partitionY - 1]\n        minRightY = float('inf') if partitionY == y else nums2[partitionY]\n\n        if maxLeftX <= minRightY and maxLeftY <= minRightX:\n            if (x + y) % 2 == 0:\n                return (max(maxLeftX, maxLeftY) + min(minRightX, minRightY)) / 2\n            return max(maxLeftX, maxLeftY)\n\n        elif maxLeftX > minRightY:\n            high = partitionX - 1\n        else:\n            low = partitionX + 1\n\n    return 0\n```\nThis algorithm uses a binary search to find the partition point in the smaller array, then uses that partition point to find the partition point in the larger array. This guarantees the optimal O(log(min(m, n))) time complexity.\nPlease note that an algorithm with O(n) time complexity would not be faster for large input sizes. It would be significantly slower because it would have to merge the entire arrays before finding the median, whereas the binary search approach only needs to look at a logarithmic number of elements.\nIn summary, the"
            }
        ],
        "origin_prompt": null
    }
]

Other information

Hi! Thanks for the brilliant work of supporting many evaluation datasets. As current LLMs evolve, the output limit of 512 in MT-Bench may no longer be suitable. Perhaps changing it to 1024 or 2048 would be more desirable?

bittersweet1999 commented 4 months ago

sure, you can set 1024 or 2048 in here https://github.com/open-compass/opencompass/blob/889e7e11409d83fe312ecc7d7f0ed8861a84cc92/configs/datasets/subjective/multiround/mtbench_single_judge_diff_temp.py#L29 we set 512 because MTbench is a multi-turn dataset and we obey the original setting in earliy MTBench

MrZhengXin commented 4 months ago

sure, you can set 1024 or 2048 in here

https://github.com/open-compass/opencompass/blob/889e7e11409d83fe312ecc7d7f0ed8861a84cc92/configs/datasets/subjective/multiround/mtbench_single_judge_diff_temp.py#L29

we set 512 because MTbench is a multi-turn dataset and we obey the original setting in earliy MTBench

好好好 不过 他们官方的长度设置 后面也改成 1024 了

https://github.com/lm-sys/FastChat/blob/92a6d1fcd69a88ea169c0b01065ce44f1e690a2c/fastchat/llm_judge/gen_model_answer.py#L236

bittersweet1999 commented 4 months ago

ok I will change later, thanks