open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.11k stars 436 forks source link

[Bug] All models meet OOM in tag 0.1.2, but back to commit 3715be6 works #249

Closed simonjoe246 closed 1 year ago

simonjoe246 commented 1 year ago

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0,1,2,3': 'Tesla P100-SXM2-16GB',
 'MMEngine': '0.8.4',
 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89',
 'OpenCV': '4.8.0',
 'PyTorch': '2.0.1+cu118',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '
                              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.8\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.7\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
                              'CUDNN_VERSION=8.7.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
                              '-DUSE_FBGEMM -DUSE_QNNPACK '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.10 (main, Aug 17 2023, 19:49:57) [GCC 9.4.0]',
 'TorchVision': '0.15.2+cu118',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.2+46b749b',
 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

configs/eval.py

with read_base():
    # choose a list of datasets
    from .datasets.ceval.ceval_gen import ceval_datasets as datasets
    # choose a model of interest
    from .models.hf_chatglm2_6b import models
    # and output the results in a choosen format
    from .summarizers.medium import summarizer

configs/hf_chatglm2_6b.py

models = [
    dict(
        type=HuggingFace,
        abbr='chatglm2-6b-hf',
        path='/mnt/home/00054055/project/models/chatGLM2-6b',
        tokenizer_path='/mnt/home/00054055/project/models/chatGLM2-6b',
        tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(trust_remote_code=True, device_map='auto', revision='a6d54fac46dff2db65d53416c207a4485ca6bd40'),
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]

重现问题 - 命令或脚本

export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1
python run.py configs/eval_chatglm2_6b.py -w /mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/

重现问题 - 错误信息

command output:

  0%|          | 0/16 [00:09<?, ?it/s]launch OpenICLInfer[chatglm2-6b-hf/ceval-college_economics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-accountant,chatglm2-6b-hf/ceval-tax_accountant] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-physician,chatglm2-6b-hf/ceval-civil_servant] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-urban_and_rural_planner,chatglm2-6b-hf/ceval-teacher_qualification] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-college_programming,chatglm2-6b-hf/ceval-electrical_engineer] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-business_administration,chatglm2-6b-hf/ceval-art_studies,chatglm2-6b-hf/ceval-fire_engineer] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer,chatglm2-6b-hf/ceval-education_science,chatglm2-6b-hf/ceval-professional_tour_guide] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-college_chemistry,chatglm2-6b-hf/ceval-metrology_engineer,chatglm2-6b-hf/ceval-mao_zedong_thought,chatglm2-6b-hf/ceval-law] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-veterinary_medicine,chatglm2-6b-hf/ceval-modern_chinese_history,chatglm2-6b-hf/ceval-chinese_language_and_literature,chatglm2-6b-hf/ceval-legal_professional] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-computer_architecture,chatglm2-6b-hf/ceval-middle_school_biology,chatglm2-6b-hf/ceval-middle_school_politics,chatglm2-6b-hf/ceval-middle_school_chemistry] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-high_school_history,chatglm2-6b-hf/ceval-computer_network,chatglm2-6b-hf/ceval-operating_system,chatglm2-6b-hf/ceval-college_physics,chatglm2-6b-hf/ceval-advanced_mathematics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-high_school_physics,chatglm2-6b-hf/ceval-high_school_chemistry,chatglm2-6b-hf/ceval-high_school_biology,chatglm2-6b-hf/ceval-middle_school_mathematics,chatglm2-6b-hf/ceval-middle_school_physics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-marxism,chatglm2-6b-hf/ceval-high_school_politics,chatglm2-6b-hf/ceval-high_school_geography,chatglm2-6b-hf/ceval-ideological_and_moral_cultivation,chatglm2-6b-hf/ceval-high_school_chinese] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-sports_science,chatglm2-6b-hf/ceval-basic_medicine,chatglm2-6b-hf/ceval-probability_and_statistics,chatglm2-6b-hf/ceval-high_school_mathematics,chatglm2-6b-hf/ceval-discrete_mathematics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-middle_school_geography] on GPU 0
08/23 10:33:51 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-logic.out

  6%|▋         | 1/16 [03:02<45:35, 182.34s/it]08/23 10:33:52 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer,chatglm2-6b-hf/ceval-education_science,chatglm2-6b-hf/ceval-professional_tour_guide] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer.out

 12%|█▎        | 2/16 [03:03<17:38, 75.58s/it] 08/23 10:33:52 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-veterinary_medicine,chatglm2-6b-hf/ceval-modern_chinese_history,chatglm2-6b-hf/ceval-chinese_language_and_literature,chatglm2-6b-hf/ceval-legal_professional] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-veterinary_medicine.out

 19%|█▉        | 3/16 [03:03<08:55, 41.18s/it]08/23 10:33:53 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-college_chemistry,chatglm2-6b-hf/ceval-metrology_engineer,chatglm2-6b-hf/ceval-mao_zedong_thought,chatglm2-6b-hf/ceval-law] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-college_chemistry.out
08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-marxism,chatglm2-6b-hf/ceval-high_school_politics,chatglm2-6b-hf/ceval-high_school_geography,chatglm2-6b-hf/ceval-ideological_and_moral_cultivation,chatglm2-6b-hf/ceval-high_school_chinese] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-marxism.out

 31%|███▏      | 5/16 [03:05<03:21, 18.36s/it]08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-physician,chatglm2-6b-hf/ceval-civil_servant] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-physician.out
08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-computer_architecture,chatglm2-6b-hf/ceval-middle_school_biology,chatglm2-6b-hf/ceval-middle_school_politics,chatglm2-6b-hf/ceval-middle_school_chemistry] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-computer_architecture.out

 44%|████▍     | 7/16 [03:05<01:31, 10.18s/it]08/23 10:33:55 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-college_programming,chatglm2-6b-hf/ceval-electrical_engineer] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-college_programming.out

ceval-logic.out: (other .out file looks similarly)

08/23 10:33:01 - OpenCompass - INFO - Task [chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine]
/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/utils/manager.py:113: UserWarning: <class 'mmengine.logging.logger.MMLogger'> instance named of OpenCompass has been created, the method `get_instance` should not accept any other arguments
  warnings.warn(

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/7 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py", line 147, in <module>
    inferencer.run()
  File "/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py", line 60, in run
    self.model = build_model_from_cfg(model_cfg)
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/build.py", line 22, in build_model_from_cfg
    return MODELS.build(model_cfg)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/mnt/home/00054055/project/code/opencompass/opencompass/models/huggingface.py", line 78, in __init__
    self._load_model(path=path,
  File "/mnt/home/00054055/project/code/opencompass/opencompass/models/huggingface.py", line 110, in _load_model
    self.model = AutoModel.from_pretrained(path, **model_kwargs)
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/fileio.py", line 162, in auto_pt
    res = ori_auto_pt.__func__(cls, pretrained_model_name_or_path,
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/fileio.py", line 138, in model_pt
    res = ori_model_pt.__func__(cls, pretrained_model_name_or_path,
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3260, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 717, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 298, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.90 GiB total capacity; 508.01 MiB already allocated; 46.81 MiB free; 510.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1921474) of binary: /mnt/home/00054055/project/code/opencompass/.venv/bin/python
Traceback (most recent call last):
  File "/mnt/home/00054055/project/code/opencompass/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py FAILED

image

其他信息

  1. I tried the on tag 0.1.2 and latest commit 0d574c0, results same. all models for all dataset meet OOM, even though opt350m, which should not happen in my machine, four nvidia-P100, each has 16G memory.
  2. checkout commit 3715be6 works for me, I tried hf_chatglm2_6b、xverse-13b, both works well with same cfg.
  3. I haven't tried on other commit.
  4. It looks like the task only used my first GPU when I use default cfg for hf_chatglm2_6b (num_gpus=1) , which can found from command ouput attched above and nivdia-smi output, different with commit 3715be6. It should use all GPU.
  5. I tried change num_gpus to 4, still OOM.
  6. I didn't chane cfg except model path
gaotongxiao commented 1 year ago

I guess it was just fixed at the latest commit. Would you mind pulling the latest change and trying again?

simonjoe246 commented 1 year ago

I guess it was just fixed at the latest commit. Would you mind pulling the latest change and trying again?

emm... Amazing, I pulling the latest commit, and it works.

After check the commit history beetween the not worked 0d574c0 and latest ff5ab92, the only difference involved in true code is the following line one in run.py: image

but I am still confused, why this line can leads the result I meet. Can you explain that? Thanks.

gaotongxiao commented 1 year ago

This option makes it possible to run several tasks on the same GPU, in case users have a powerful GPU that cannot be fully utilized by a single task. For example, we found that a 7B model does not saturate A100-80GB and the GPU utilization rate is constantly below 50%. Setting it to 32 by default was a mistake.