open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.05k stars 428 forks source link

[Bug] Result output is 0. #1033

Closed nanxue2023 closed 6 months ago

nanxue2023 commented 7 months ago

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境


 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0',
 'GPU 0,1,2,3': 'NVIDIA A40',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.0, V12.0.140',
 'OpenCV': '4.9.0',
 'PyTorch': '2.2.1+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.3.2 (Git Hash '
                              '2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
 'TorchVision': '0.17.1+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.3+322544d',
 'sys.platform': 'linux'}```

### 重现问题 - 代码/配置示例

dataset      version  metric    mode      mistral
---------  ---------  --------  ------  --------------
BoolQ         314797  accuracy  ppl                  0

### 重现问题 - 命令或脚本

```python run.py --models mistral --datasets SuperGLUE_BoolQ_ppl_314797```

### 重现问题 - 错误信息

I downloaded mistral model file from mistral official website and use opencompass to evaluate the model on BoolQ dataset. Everything goes well until the result output shows. The 'BoolQ.json' in folder 'results/' is correct and I can check the labels, predictions and golds but the result calculation seems to be wrong. I need some help to find the problems and solve it! 

### 其他信息

I guess something goes wrong in /opencompass/tasks/openicl_eval.py and /opencompass/openicl/icl_evaluator/icl_hf_evaluator.py>AccEvaluator.
liushz commented 7 months ago

Can you provide some prediction sample here?

nanxue2023 commented 7 months ago

Sure! Here are the prediction samples.


    "0": {
        "in-context examples": "",
        "label: false": {
            "testing input": "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requires a tropical climate to grow productively, returns from 8 to 9 units of energy for each unit expended, as compared to corn, which only returns about 1.34 units of fuel energy for each unit of energy expended. A 2006 University of California Berkeley study, after analyzing six separate studies, concluded that producing ethanol from corn uses much less petroleum than producing gasoline.\nQuestion: Does ethanol take more energy make that produces?\nAnswer: No",
            "prompt": "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requires a tropical climate to grow productively, returns from 8 to 9 units of energy for each unit expended, as compared to corn, which only returns about 1.34 units of fuel energy for each unit of energy expended. A 2006 University of California Berkeley study, after analyzing six separate studies, concluded that producing ethanol from corn uses much less petroleum than producing gasoline.\nQuestion: Does ethanol take more energy make that produces?\nAnswer: No",
            "PPL": 1.8386764526367188,
            "BPB": 0.41776557521925445
        },
        "label: true": {
            "testing input": "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requires a tropical climate to grow productively, returns from 8 to 9 units of energy for each unit expended, as compared to corn, which only returns about 1.34 units of fuel energy for each unit of energy expended. A 2006 University of California Berkeley study, after analyzing six separate studies, concluded that producing ethanol from corn uses much less petroleum than producing gasoline.\nQuestion: Does ethanol take more energy make that produces?\nAnswer: Yes",
            "prompt": "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requires a tropical climate to grow productively, returns from 8 to 9 units of energy for each unit expended, as compared to corn, which only returns about 1.34 units of fuel energy for each unit of energy expended. A 2006 University of California Berkeley study, after analyzing six separate studies, concluded that producing ethanol from corn uses much less petroleum than producing gasoline.\nQuestion: Does ethanol take more energy make that produces?\nAnswer: Yes",
            "PPL": 1.847572698825743,
            "BPB": 0.41949718282516874
        },
        "prediction": "false",
        "gold": false
    },
    "1": {
        "in-context examples": "",
        "label: false": {
            "testing input": "Passage: Property tax or 'house tax' is a local tax on buildings, along with appurtenant land. It is and imposed on the Possessor (not the custodian of property as per 1978, 44th amendment of constitution). It resembles the US-type wealth tax and differs from the excise-type UK rate. The tax power is vested in the states and is delegated to local bodies, specifying the valuation method, rate band, and collection procedures. The tax base is the annual rental value (ARV) or area-based rating. Owner-occupied and other properties not producing rent are assessed on cost and then converted into ARV by applying a percentage of cost, usually four percent. Vacant land is generally exempt. Central government properties are exempt. Instead a 'service charge' is permissible under executive order. Properties of foreign missions also enjoy tax exemption without requiring reciprocity. The tax is usually accompanied by service taxes, e.g., water tax, drainage tax, conservancy (sanitation) tax, lighting tax, all using the same tax base. The rate structure is flat on rural (panchayat) properties, but in the urban (municipal) areas it is mildly progressive with about 80% of assessments falling in the first two brackets.\nQuestion: Is house tax and property tax are same?\nAnswer: No",
            "prompt": "Passage: Property tax or 'house tax' is a local tax on buildings, along with appurtenant land. It is and imposed on the Possessor (not the custodian of property as per 1978, 44th amendment of constitution). It resembles the US-type wealth tax and differs from the excise-type UK rate. The tax power is vested in the states and is delegated to local bodies, specifying the valuation method, rate band, and collection procedures. The tax base is the annual rental value (ARV) or area-based rating. Owner-occupied and other properties not producing rent are assessed on cost and then converted into ARV by applying a percentage of cost, usually four percent. Vacant land is generally exempt. Central government properties are exempt. Instead a 'service charge' is permissible under executive order. Properties of foreign missions also enjoy tax exemption without requiring reciprocity. The tax is usually accompanied by service taxes, e.g., water tax, drainage tax, conservancy (sanitation) tax, lighting tax, all using the same tax base. The rate structure is flat on rural (panchayat) properties, but in the urban (municipal) areas it is mildly progressive with about 80% of assessments falling in the first two brackets.\nQuestion: Is house tax and property tax are same?\nAnswer: No",
            "PPL": 1.0577706585359727,
            "BPB": 0.2576303243272627
        },
        "label: true": {
            "testing input": "Passage: Property tax or 'house tax' is a local tax on buildings, along with appurtenant land. It is and imposed on the Possessor (not the custodian of property as per 1978, 44th amendment of constitution). It resembles the US-type wealth tax and differs from the excise-type UK rate. The tax power is vested in the states and is delegated to local bodies, specifying the valuation method, rate band, and collection procedures. The tax base is the annual rental value (ARV) or area-based rating. Owner-occupied and other properties not producing rent are assessed on cost and then converted into ARV by applying a percentage of cost, usually four percent. Vacant land is generally exempt. Central government properties are exempt. Instead a 'service charge' is permissible under executive order. Properties of foreign missions also enjoy tax exemption without requiring reciprocity. The tax is usually accompanied by service taxes, e.g., water tax, drainage tax, conservancy (sanitation) tax, lighting tax, all using the same tax base. The rate structure is flat on rural (panchayat) properties, but in the urban (municipal) areas it is mildly progressive with about 80% of assessments falling in the first two brackets.\nQuestion: Is house tax and property tax are same?\nAnswer: Yes",
            "prompt": "Passage: Property tax or 'house tax' is a local tax on buildings, along with appurtenant land. It is and imposed on the Possessor (not the custodian of property as per 1978, 44th amendment of constitution). It resembles the US-type wealth tax and differs from the excise-type UK rate. The tax power is vested in the states and is delegated to local bodies, specifying the valuation method, rate band, and collection procedures. The tax base is the annual rental value (ARV) or area-based rating. Owner-occupied and other properties not producing rent are assessed on cost and then converted into ARV by applying a percentage of cost, usually four percent. Vacant land is generally exempt. Central government properties are exempt. Instead a 'service charge' is permissible under executive order. Properties of foreign missions also enjoy tax exemption without requiring reciprocity. The tax is usually accompanied by service taxes, e.g., water tax, drainage tax, conservancy (sanitation) tax, lighting tax, all using the same tax base. The rate structure is flat on rural (panchayat) properties, but in the urban (municipal) areas it is mildly progressive with about 80% of assessments falling in the first two brackets.\nQuestion: Is house tax and property tax are same?\nAnswer: Yes",
            "PPL": 1.0422258285080888,
            "BPB": 0.2536462234746675
        },
        "prediction": "true",
        "gold": true
    },
    "2": {
        "in-context examples": "",
        "label: false": {
            "testing input": "Passage: Phantom pain sensations are described as perceptions that an individual experiences relating to a limb or an organ that is not physically part of the body. Limb loss is a result of either removal by amputation or congenital limb deficiency. However, phantom limb sensations can also occur following nerve avulsion or spinal cord injury.\nQuestion: Is pain experienced in a missing body part or paralyzed area?\nAnswer: No",
            "prompt": "Passage: Phantom pain sensations are described as perceptions that an individual experiences relating to a limb or an organ that is not physically part of the body. Limb loss is a result of either removal by amputation or congenital limb deficiency. However, phantom limb sensations can also occur following nerve avulsion or spinal cord injury.\nQuestion: Is pain experienced in a missing body part or paralyzed area?\nAnswer: No",
            "PPL": 1.6369797949697458,
            "BPB": 0.3939460721539342
        },
        "label: true": {
            "testing input": "Passage: Phantom pain sensations are described as perceptions that an individual experiences relating to a limb or an organ that is not physically part of the body. Limb loss is a result of either removal by amputation or congenital limb deficiency. However, phantom limb sensations can also occur following nerve avulsion or spinal cord injury.\nQuestion: Is pain experienced in a missing body part or paralyzed area?\nAnswer: Yes",
            "prompt": "Passage: Phantom pain sensations are described as perceptions that an individual experiences relating to a limb or an organ that is not physically part of the body. Limb loss is a result of either removal by amputation or congenital limb deficiency. However, phantom limb sensations can also occur following nerve avulsion or spinal cord injury.\nQuestion: Is pain experienced in a missing body part or paralyzed area?\nAnswer: Yes",
            "PPL": 1.6018272848690258,
            "BPB": 0.38458790289396194
        },
        "prediction": "true",
        "gold": true
    },
    "3": {
        "in-context examples": "",
        "label: false": {
            "testing input": "Passage: Harry Potter and the Escape from Gringotts is an indoor steel roller coaster at Universal Studios Florida, a theme park located within the Universal Orlando Resort. Similar to dark rides, the roller coaster utilizes special effects in a controlled-lighting environment and also employs motion-based 3-D projection of both animation and live-action sequences to enhance the experience. The ride, which is themed to the Gringotts Wizarding Bank, became the flagship attraction for the expanded Wizarding World of Harry Potter when it opened on July 8, 2014.\nQuestion: Is harry potter and the escape from gringotts a roller coaster ride?\nAnswer: No",
            "prompt": "Passage: Harry Potter and the Escape from Gringotts is an indoor steel roller coaster at Universal Studios Florida, a theme park located within the Universal Orlando Resort. Similar to dark rides, the roller coaster utilizes special effects in a controlled-lighting environment and also employs motion-based 3-D projection of both animation and live-action sequences to enhance the experience. The ride, which is themed to the Gringotts Wizarding Bank, became the flagship attraction for the expanded Wizarding World of Harry Potter when it opened on July 8, 2014.\nQuestion: Is harry potter and the escape from gringotts a roller coaster ride?\nAnswer: No",
            "PPL": 1.4060681892163827,
            "BPB": 0.35689192570324085
        },
        "label: true": {
            "testing input": "Passage: Harry Potter and the Escape from Gringotts is an indoor steel roller coaster at Universal Studios Florida, a theme park located within the Universal Orlando Resort. Similar to dark rides, the roller coaster utilizes special effects in a controlled-lighting environment and also employs motion-based 3-D projection of both animation and live-action sequences to enhance the experience. The ride, which is themed to the Gringotts Wizarding Bank, became the flagship attraction for the expanded Wizarding World of Harry Potter when it opened on July 8, 2014.\nQuestion: Is harry potter and the escape from gringotts a roller coaster ride?\nAnswer: Yes",
            "prompt": "Passage: Harry Potter and the Escape from Gringotts is an indoor steel roller coaster at Universal Studios Florida, a theme park located within the Universal Orlando Resort. Similar to dark rides, the roller coaster utilizes special effects in a controlled-lighting environment and also employs motion-based 3-D projection of both animation and live-action sequences to enhance the experience. The ride, which is themed to the Gringotts Wizarding Bank, became the flagship attraction for the expanded Wizarding World of Harry Potter when it opened on July 8, 2014.\nQuestion: Is harry potter and the escape from gringotts a roller coaster ride?\nAnswer: Yes",
            "PPL": 1.3540132464784564,
            "BPB": 0.3431545021609523
        },
        "prediction": "true",
        "gold": true
    },
    "4": {
        "in-context examples": "",
        "label: false": {
            "testing input": "Passage: Hydroxyzine preparations require a doctor's prescription. The drug is available in two formulations, the pamoate and the dihydrochloride or hydrochloride salts. Vistaril, Equipose, Masmoran, and Paxistil are preparations of the pamoate salt, while Atarax, Alamon, Aterax, Durrax, Tran-Q, Orgatrax, Quiess, and Tranquizine are of the hydrochloride salt.\nQuestion: Is there a difference between hydroxyzine hcl and hydroxyzine pam?\nAnswer: No",
            "prompt": "Passage: Hydroxyzine preparations require a doctor's prescription. The drug is available in two formulations, the pamoate and the dihydrochloride or hydrochloride salts. Vistaril, Equipose, Masmoran, and Paxistil are preparations of the pamoate salt, while Atarax, Alamon, Aterax, Durrax, Tran-Q, Orgatrax, Quiess, and Tranquizine are of the hydrochloride salt.\nQuestion: Is there a difference between hydroxyzine hcl and hydroxyzine pam?\nAnswer: No",
            "PPL": 1.7489594558189656,
            "BPB": 0.5687039655892405
        },
        "label: true": {
            "testing input": "Passage: Hydroxyzine preparations require a doctor's prescription. The drug is available in two formulations, the pamoate and the dihydrochloride or hydrochloride salts. Vistaril, Equipose, Masmoran, and Paxistil are preparations of the pamoate salt, while Atarax, Alamon, Aterax, Durrax, Tran-Q, Orgatrax, Quiess, and Tranquizine are of the hydrochloride salt.\nQuestion: Is there a difference between hydroxyzine hcl and hydroxyzine pam?\nAnswer: Yes",
            "prompt": "Passage: Hydroxyzine preparations require a doctor's prescription. The drug is available in two formulations, the pamoate and the dihydrochloride or hydrochloride salts. Vistaril, Equipose, Masmoran, and Paxistil are preparations of the pamoate salt, while Atarax, Alamon, Aterax, Durrax, Tran-Q, Orgatrax, Quiess, and Tranquizine are of the hydrochloride salt.\nQuestion: Is there a difference between hydroxyzine hcl and hydroxyzine pam?\nAnswer: Yes",
            "PPL": 1.7053124789533944,
            "BPB": 0.5532791598382124
        },
        "prediction": "true",
        "gold": true
liushz commented 6 months ago

There seems to be an issue with the BoolQ dataset configuration. We will promptly release a fixed version to address this. We will notify you in this Issue as soon as the fix is implemented. Sorry for the inconvenience.

nanxue2023 commented 6 months ago

Okay! Thanks and I'm willing to receive your notification

Leymore commented 6 months ago

Hello,@nanxue2023 . I failed to reproduce the results with the latest main branch. The error is clearly lies in the result file that the prediction field type is str, and the gold field type is bool. I guess you have used the dataset from huggingface instead of ./data/SuperGLUE/BoolQ/val.jsonl. You may change the code in opencompass/datasets/boolq.py, make the answer field str type, delete the result file and do evaluate once again.

nanxue2023 commented 6 months ago

@Leymore Thanks. I check the result file and code once again. My dataset is from https://super.gluebenchmark.com/tasks and I didn't modify the code in opencompass/datasets/boolq.py. But I'm sure that the problem is the field type difference as you said. I don't know how to fix it.

nanxue2023 commented 6 months ago

I solved the problem by downloading the datasets according to the document and found the dataset is a bit different from https://super.gluebenchmark.com/tasks. Thanks!