open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.02k stars 426 forks source link

[Bug] Getting 0 accuracy for Llamma3-8b and qwen2-7b models #1412

Closed sriyachakravarthy closed 2 months ago

sriyachakravarthy commented 2 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': False, 'GCC': 'gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0', 'MMEngine': '0.10.4', 'MUSA available': False, 'OpenCV': '4.10.0', 'PyTorch': '2.4.0+rocm6.1', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2022.2-Product Build 20220804 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.4.2 (Git Hash ' '1137e04ec0b5251ca2b4400a4fd3c667ce843d67)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOCUPTI -DUSE_FBGEMM ' '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK ' '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC ' '-Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wsuggest-override ' '-Wno-psabi -Wno-error=pedantic ' '-Wno-error=old-style-cast -Wno-missing-braces ' '-fdiagnostics-color=always -faligned-new ' '-Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, ' 'USE_CUDA=OFF, USE_CUDNN=OFF, ' 'USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, ' 'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, ' 'USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, ' 'USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, ' 'USE_ROCM=ON, USE_ROCM_KERNEL_ASSERT=OFF, \n', 'Python': '3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]', 'TorchVision': '0.19.0+rocm6.1', 'lmdeploy': "not installed:No module named 'lmdeploy'", 'numpy_random_seed': 2147483648, 'opencompass': '0.3.0+88eb912', 'sys.platform': 'linux', 'transformers': '4.44.0'}

Reproduces the problem - code/configuration sample

CUDA_VISIBLE_DEVICES=0 python -u run.py --datasets commonsenseqa_gen --hf-num-gpus 1 --hf-type base --hf-path meta-llama/Meta-Llama-3-8B --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 1

Reproduces the problem - command or script

outputs/default/20240809_090910/results/Meta-Llama-3-8B_hf/commonsense_qa.json dataset version metric mode Meta-Llama-3-8B_hf


commonsense_qa c946f2 accuracy gen 0.00

Reproduces the problem - error message

outputs/default/20240809_090910/results/Meta-Llama-3-8B_hf/commonsense_qa.json dataset version metric mode Meta-Llama-3-8B_hf


commonsense_qa c946f2 accuracy gen 0.00

Other information

No response

sriyachakravarthy commented 2 months ago

Also, getting the following error for truthfulqa: AssertionError: truth_model should be set to perform API eval.If you want to perform basic metric eval, please refer to the docstring of /rhome/sriyar/Sriya/opencompass/opencompass/datasets/truthfulqa.py for more details. Inference Time/Execution Time: 3044 seconds

tonysy commented 2 months ago

For the base model, we recommend using perplexity (ppl) for evaluation in multi-choice questions.

python -u run.py --datasets commonsenseqa_ppl --hf-num-gpus 1 --hf-type base --hf-path meta-llama/Meta-Llama-3-8B --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 8
dataset           version  metric    mode      Meta-Llama-3-8B_hf
--------------  ---------  --------  ------  --------------------
commonsense_qa  554500.00  accuracy  ppl                    70.19
tonysy commented 2 months ago

Feel free to re-open if needed.

ppalantir commented 2 months ago

For the base model, we recommend using perplexity (ppl) for evaluation in multi-choice questions.

python -u run.py --datasets commonsenseqa_ppl --hf-num-gpus 1 --hf-type base --hf-path meta-llama/Meta-Llama-3-8B --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 8
dataset           version  metric    mode      Meta-Llama-3-8B_hf
--------------  ---------  --------  ------  --------------------
commonsense_qa  554500.00  accuracy  ppl                    70.19

hi, could you please help to check why the results are blank? (opencompass) [~/project/LLM/opencompass]$ python -u run.py --datasets commonsenseqa_ppl --hf-num-gpus 1 --hf-type base --hf-path meta-llama/Meta-Llama-3-8B --debug --model-kwargs device_map='auto' trust_remote_code=True --batch-size 8 /home/xxxx/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead. warnings.warn( 08/21 21:05:26 - OpenCompass - INFO - Loading commonsenseqa_ppl: configs/datasets/commonsenseqa/commonsenseqa_ppl.py 08/21 21:05:26 - OpenCompass - DEBUG - Using model: {'type': 'opencompass.models.huggingface_above_v4_33.HuggingFaceBaseModel', 'abbr': 'Meta-Llama-3-8B_hf', 'path': 'meta-llama/Meta-Llama-3-8B', 'model_kwargs': {'device_map': 'auto', 'trust_remote_code': True}, 'tokenizer_path': None, 'tokenizer_kwargs': {}, 'generation_kwargs': {}, 'peft_path': None, 'peft_kwargs': {}, 'max_seq_len': None, 'max_out_len': 256, 'batch_size': 8, 'pad_token_id': None, 'stop_words': [], 'run_cfg': {'num_gpus': 1}} 08/21 21:05:26 - OpenCompass - INFO - Loading example: configs/summarizers/example.py 08/21 21:05:26 - OpenCompass - INFO - Current exp folder: outputs/default/20240821_210526 08/21 21:05:26 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored. 08/21 21:05:26 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners 08/21 21:05:26 - OpenCompass - DEBUG - Get class NumWorkerPartitioner from "partitioner" registry in "opencompass" 08/21 21:05:26 - OpenCompass - DEBUG - An NumWorkerPartitioner instance is built from registry, and its implementation can be found in opencompass.partitioners.num_worker 08/21 21:05:26 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored. 08/21 21:05:26 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored. 08/21 21:05:26 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored. 08/21 21:05:26 - OpenCompass - DEBUG - Key eval.runner.task.cal_extract_rate not found in config, ignored. 08/21 21:05:26 - OpenCompass - DEBUG - Additional config: {} 08/21 21:05:26 - OpenCompass - INFO - Partitioned into 1 tasks. 08/21 21:05:26 - OpenCompass - DEBUG - Task 0: [Meta-Llama-3-8B_hf/commonsense_qa] 08/21 21:05:26 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners 08/21 21:05:26 - OpenCompass - DEBUG - Get class LocalRunner from "runner" registry in "opencompass" 08/21 21:05:26 - OpenCompass - DEBUG - An LocalRunner instance is built from registry, and its implementation can be found in opencompass.runners.local 08/21 21:05:26 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks 08/21 21:05:26 - OpenCompass - DEBUG - Get class OpenICLInferTask from "task" registry in "opencompass" 08/21 21:05:26 - OpenCompass - DEBUG - An OpenICLInferTask instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer 08/21 21:05:27 - OpenCompass - WARNING - Only use 1 GPUs for total 2 available GPUs in debug mode. 08/21 21:05:27 - OpenCompass - DEBUG - Debug mode, log will be saved to tmp/3412580_debug.log 08/21 21:05:33 - OpenCompass - DEBUG - Get class NaivePartitioner from "partitioner" registry in "opencompass" 08/21 21:05:33 - OpenCompass - DEBUG - An NaivePartitioner instance is built from registry, and its implementation can be found in opencompass.partitioners.naive 08/21 21:05:33 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored. 08/21 21:05:33 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored. 08/21 21:05:33 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored. 08/21 21:05:33 - OpenCompass - DEBUG - Key eval.runner.task.cal_extract_rate not found in config, ignored. 08/21 21:05:33 - OpenCompass - DEBUG - Additional config: {'eval': {'runner': {'task': {}}}} 08/21 21:05:33 - OpenCompass - INFO - Partitioned into 1 tasks. 08/21 21:05:33 - OpenCompass - DEBUG - Task 0: [Meta-Llama-3-8B_hf/commonsense_qa] 08/21 21:05:33 - OpenCompass - DEBUG - Get class LocalRunner from "runner" registry in "opencompass" 08/21 21:05:33 - OpenCompass - DEBUG - An LocalRunner instance is built from registry, and its implementation can be found in opencompass.runners.local 08/21 21:05:33 - OpenCompass - DEBUG - Get class OpenICLEvalTask from "task" registry in "opencompass" 08/21 21:05:33 - OpenCompass - DEBUG - An OpenICLEvalTask instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_eval 08/21 21:05:34 - OpenCompass - DEBUG - Modules of opencompass's load_dataset registry have been automatically imported from opencompass.datasets 08/21 21:05:34 - OpenCompass - DEBUG - Get class commonsenseqaDataset from "load_dataset" registry in "opencompass" 08/21 21:05:34 - OpenCompass - DEBUG - An commonsenseqaDataset instance is built from registry, and its implementation can be found in opencompass.datasets.commonsenseqa 08/21 21:05:34 - OpenCompass - ERROR - /home/xxxx/project/LLM/opencompass/opencompass/tasks/openicl_eval.py - _score - 253 - Task [Meta-Llama-3-8B_hf/commonsense_qa]: No predictions found. 08/21 21:05:34 - OpenCompass - DEBUG - An DefaultSummarizer instance is built from registry, and its implementation can be found in opencompass.summarizers.default dataset version metric mode Meta-Llama-3-8B_hf


commonsense_qa - - - - 08/21 21:05:34 - OpenCompass - INFO - write summary to /home/xxxx/project/LLM/opencompass/outputs/default/20240821_210526/summary/summary_20240821_210526.txt 08/21 21:05:34 - OpenCompass - INFO - write csv to /home/xxxx/project/LLM/opencompass/outputs/default/20240821_210526/summary/summary_20240821_210526.csv