open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.12k stars 437 forks source link

[Bug] Prompt with trailing whitespace may hurt model performance #928

Open yzlnew opened 8 months ago

yzlnew commented 8 months ago

Prerequisite

Type

I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (GCC) 9.2.1 20200522 (Alibaba 9.2.1-3 2.17)',
 'GPU 0,1,2,3': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.9.0',
 'PyTorch': '2.1.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.1.1 (Git Hash '
                              '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.3\n'
                              '    - Built with CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=old-style-cast '
                              '-Wno-invalid-partial-specialization '
                              '-Wno-unused-private-field '
                              '-Wno-aligned-allocation-unavailable '
                              '-Wno-missing-braces -fdiagnostics-color=always '
                              '-faligned-new -Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]',
 'TorchVision': '0.16.0+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.1+',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

Evaluating my own model.

Reproduces the problem - command or script

python run.py --datasets agieval_gen \
--models $MY_MODEL \
--model-kwargs device_map='auto' \
--tokenizer-path $TOKENIZER_PATH \
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \
--max-out-len $MAX_OUT_LEN \
--max-seq-len 2048 \
--batch-size 8 \
--no-batch-padding \
--work-dir $WORK_DIR \

Reproduces the problem - error message

None

Other information

I'm evaluating on AGIEval and notice a performance drop under default config. Dig into predictions, I find that model generates unusual tokens, like multi white spaces or "\n".

https://github.com/open-compass/opencompass/blob/ba7cd58da3317bdec233d097153e2ab92c5f5dd5/configs/datasets/agieval/agieval_gen_64afd3.py#L72

The issue is gone when I remove the trailing whitespace. It seems like an OOD problem when a base model tries to predict under a situation not seen in the pre-training stage, which is also mentioned in this video. Go back to the original repo of AGIEval, there're no trailing whitespaces.

tonysy commented 8 months ago

You are right. The LLMs are sensitive to the prompt.

yzlnew commented 8 months ago

@tonysy Is it considered a bug of OpenCompass and getting fixed in future release? I've noticed several other datasets with prompt configured similarly, which could cause possible performance downgrade.

tonysy commented 8 months ago

I think it is not a bug, it's the issue of the LLM other than the evaluation. Actually, we may need introduce several different prompts to improve the robustness of the evaluation.

yzlnew commented 8 months ago

@tonysy I agree with this view. However, I want to point out that OpenCompass could give different results compared to the original version, and the prompts change with different datasets, such as those with or without trailing whitespace.

But, we can partially fix this issue in the tokenization stage. Therefore, model using a tokenizer that additionally processes trailing whitespace results in higher scores on the leaderboard, but it does not reflect the true capability of the model.

tonysy commented 8 months ago

Right, we are working on the prompt sensitivity and will provide multi-prompt result recently. Stay tuned.

longxudou commented 5 months ago

@yzlnew It's a problem related to bpe dropout. Our paper has discussed this problem. https://arxiv.org/pdf/2404.03608

image
yzlnew commented 5 months ago

@longxudou Thanks. It seems like a simple but effective fix during the tokenization stage.