open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.11k stars 437 forks source link

显卡out of memory? #182

Closed lunalulu closed 1 year ago

lunalulu commented 1 year ago

描述该错误

为什么我在评估模型的时候,模型要创建那么多的task,然后每个task都会加载一次模型,最终显卡out of memory?

环境信息

{'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda-11.7', 'GCC': 'gcc (GCC) 10.2.0', 'GPU 0,1,2,3': 'NVIDIA A100 80GB PCIe', 'MMEngine': '0.8.4', 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.64', 'OpenCV': '4.8.0', 'PyTorch': '2.0.1', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2021.4-Product Build 20210904 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v2.7.3 (Git Hash ' '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX2\n' ' - CUDA Runtime 11.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n' ' - CuDNN 8.7\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=11.8, ' 'CUDNN_VERSION=8.7.0, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -Wno-deprecated ' '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL ' '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER ' '-DUSE_FBGEMM -DUSE_QNNPACK ' '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK ' '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC ' '-Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wunused-local-typedefs ' '-Wno-unused-parameter -Wno-unused-function ' '-Wno-unused-result -Wno-strict-overflow ' '-Wno-strict-aliasing ' '-Wno-error=deprecated-declarations ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=redundant-decls ' '-Wno-error=old-style-cast ' '-fdiagnostics-color=always -faligned-new ' '-Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) ' '[GCC 12.3.0]', 'TorchVision': '0.15.2', 'numpy_random_seed': 2147483648, 'opencompass': '0.1.0+', 'sys.platform': 'linux'}

其他信息

No response

Ezra-Yu commented 1 year ago

你测试的模型以及使用的显卡是什么?最好贴上模型的配置信息。

lunalulu commented 1 year ago

模型是:baichuan-inc/Baichuan-13B-Chat 显卡是A100

Ezra-Yu commented 1 year ago

可以设置 '--max-partition-size' '--gen-task-coef' 减少分片数目。

python run.py YOUR_CONFIG --max-partition-size 10000 --gen-task-coef 10
lunalulu commented 1 year ago

好的,我试试🙏

---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2023年08月10日 13:45 | | 收件人 | InternLM/opencompass @.> | | 抄送人 | lunalulu @.>, Author @.> | | 主题 | Re: [InternLM/opencompass] 显卡out of memory? (Issue #182) |

可以设置 '--max-partition-size' '--gen-task-coef' 减少分片数目。

python run.py YOUR_CONFIG --max-partition-size 10000 --gen-task-coef 10

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Ezra-Yu commented 1 year ago

按道理,每次释放后不会在占用显存了,不会出现这类bug,可以给下你的配置文件吗?我们来debug一下。你们用的80G的还是40G.

lunalulu commented 1 year ago

我用的80G的A100,配置的化我就改了下面的内容 1、dataset 只选择ceval `from mmengine.config import read_base

with read_base(): from ..ceval.ceval_gen_5f30c7 import ceval_datasets

from ..SuperGLUE_AX_b.SuperGLUE_AX_b_gen_4dfefa import AX_b_datasets

# from ..SuperGLUE_AX_g.SuperGLUE_AX_g_gen_68aac7 import AX_g_datasets
# from ..SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_883d50 import BoolQ_datasets
# from ..SuperGLUE_CB.SuperGLUE_CB_gen_854c6c import CB_datasets
# from ..SuperGLUE_COPA.SuperGLUE_COPA_gen_91ca53 import COPA_datasets
# from ..SuperGLUE_MultiRC.SuperGLUE_MultiRC_gen_27071f import MultiRC_datasets
# from ..SuperGLUE_RTE.SuperGLUE_RTE_gen_68aac7 import RTE_datasets
# from ..SuperGLUE_ReCoRD.SuperGLUE_ReCoRD_gen_30dea0 import ReCoRD_datasets
# from ..SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
# from ..SuperGLUE_WSC.SuperGLUE_WSC_gen_8a881c import WSC_datasets

datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])` 2、模型加载我改成了本地模型路径 image

yangkexin commented 1 year ago

baichuan-13b-base在mmlu上用80gA100评测也出现了这个问题,现在有解决吗

Ezra-Yu commented 1 year ago

@yangkexin @lunalulu

将batch-size改成4就可以跑了,具体的测试config如下,亲测没问题。

from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM

with read_base():
    from .datasets.ceval.ceval_gen import ceval_datasets

datasets = [*ceval_datasets]

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr='baichuan-13b-chat-hf',
        path="baichuan-inc/Baichuan-13B-Chat",
        tokenizer_path='baichuan-inc/Baichuan-13B-Chat',
        tokenizer_kwargs=dict(padding_side='left',
                              truncation_side='left',
                              trust_remote_code=True,
                              use_fast=False,),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=4,
        model_kwargs=dict(device_map='auto', trust_remote_code=True, revision='75cc8a7e5220715ebccb771581e6ca8c1377cf71'),
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]
lunalulu commented 1 year ago

我这边昨天重新拉取了下代码,貌似就可以正常跑了 但是我评估了百川-13B-chat原始模型跟官方给的结果貌似不一致,我测试C-EVAL的val 平均为50.04,官方给出的是51.5,我现在不确定是测试方式不一样吗? @Ezra-Yu

lunalulu commented 1 year ago

chatglm2-6B,测试C-EVAL的val 平均为51.35

Ezra-Yu commented 1 year ago

但是我评估了百川-13B-chat原始模型跟官方给的结果貌似不一致,我测试C-EVAL的val 平均为50.04,官方给出的是51.5,我现在不确定是测试方式不一样吗?

这是另外一个问题,你可以再开一个issue。

lunalulu commented 1 year ago

好的~

yangkexin commented 1 year ago

我是mmlu不能跑,现在已经改成batch=1 max-length 2048也不行, 跑到lukaemon_mmlu_high_school_psychology_1的时候就会爆掉

yangkexin commented 1 year ago

我是mmlu不能跑,现在已经改成batch=1 max-length 2048也不行, 跑到lukaemon_mmlu_high_school_psychology_1的时候就会爆掉

根据上面的改成了--max-partition-size 1000 --gen-task-coef 10,正在尝试

Ezra-Yu commented 1 year ago

我是mmlu不能跑,现在已经改成batch=1 max-length 2048也不行, 跑到lukaemon_mmlu_high_school_psychology_1的时候就会爆掉

我试验跑了一下 baichuan13B-chat 的 mmlu, 在batchsize 为 4的情况下,我这边的环境跑下来是没问题的。