[Bug] All models meet OOM in tag 0.1.2, but back to commit 3715be6 works

先决条件

[X] 我已经搜索过问题和讨论但未得到预期的帮助。
[X] 错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0,1,2,3': 'Tesla P100-SXM2-16GB',
 'MMEngine': '0.8.4',
 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89',
 'OpenCV': '4.8.0',
 'PyTorch': '2.0.1+cu118',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '
                              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.8\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.7\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.8, '
                              'CUDNN_VERSION=8.7.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '
                              '-DUSE_FBGEMM -DUSE_QNNPACK '
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '
                              '-Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.10 (main, Aug 17 2023, 19:49:57) [GCC 9.4.0]',
 'TorchVision': '0.15.2+cu118',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.2+46b749b',
 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

configs/eval.py

with read_base():
    # choose a list of datasets
    from .datasets.ceval.ceval_gen import ceval_datasets as datasets
    # choose a model of interest
    from .models.hf_chatglm2_6b import models
    # and output the results in a choosen format
    from .summarizers.medium import summarizer

configs/hf_chatglm2_6b.py

models = [
    dict(
        type=HuggingFace,
        abbr='chatglm2-6b-hf',
        path='/mnt/home/00054055/project/models/chatGLM2-6b',
        tokenizer_path='/mnt/home/00054055/project/models/chatGLM2-6b',
        tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(trust_remote_code=True, device_map='auto', revision='a6d54fac46dff2db65d53416c207a4485ca6bd40'),
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]

重现问题 - 命令或脚本

export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1
python run.py configs/eval_chatglm2_6b.py -w /mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/

重现问题 - 错误信息

command output:

  0%|          | 0/16 [00:09<?, ?it/s]launch OpenICLInfer[chatglm2-6b-hf/ceval-college_economics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-accountant,chatglm2-6b-hf/ceval-tax_accountant] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-physician,chatglm2-6b-hf/ceval-civil_servant] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-urban_and_rural_planner,chatglm2-6b-hf/ceval-teacher_qualification] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-college_programming,chatglm2-6b-hf/ceval-electrical_engineer] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-business_administration,chatglm2-6b-hf/ceval-art_studies,chatglm2-6b-hf/ceval-fire_engineer] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer,chatglm2-6b-hf/ceval-education_science,chatglm2-6b-hf/ceval-professional_tour_guide] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-college_chemistry,chatglm2-6b-hf/ceval-metrology_engineer,chatglm2-6b-hf/ceval-mao_zedong_thought,chatglm2-6b-hf/ceval-law] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-veterinary_medicine,chatglm2-6b-hf/ceval-modern_chinese_history,chatglm2-6b-hf/ceval-chinese_language_and_literature,chatglm2-6b-hf/ceval-legal_professional] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-computer_architecture,chatglm2-6b-hf/ceval-middle_school_biology,chatglm2-6b-hf/ceval-middle_school_politics,chatglm2-6b-hf/ceval-middle_school_chemistry] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-high_school_history,chatglm2-6b-hf/ceval-computer_network,chatglm2-6b-hf/ceval-operating_system,chatglm2-6b-hf/ceval-college_physics,chatglm2-6b-hf/ceval-advanced_mathematics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-high_school_physics,chatglm2-6b-hf/ceval-high_school_chemistry,chatglm2-6b-hf/ceval-high_school_biology,chatglm2-6b-hf/ceval-middle_school_mathematics,chatglm2-6b-hf/ceval-middle_school_physics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-marxism,chatglm2-6b-hf/ceval-high_school_politics,chatglm2-6b-hf/ceval-high_school_geography,chatglm2-6b-hf/ceval-ideological_and_moral_cultivation,chatglm2-6b-hf/ceval-high_school_chinese] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-sports_science,chatglm2-6b-hf/ceval-basic_medicine,chatglm2-6b-hf/ceval-probability_and_statistics,chatglm2-6b-hf/ceval-high_school_mathematics,chatglm2-6b-hf/ceval-discrete_mathematics] on GPU 0
launch OpenICLInfer[chatglm2-6b-hf/ceval-middle_school_geography] on GPU 0
08/23 10:33:51 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-logic.out

  6%|▋         | 1/16 [03:02<45:35, 182.34s/it]08/23 10:33:52 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer,chatglm2-6b-hf/ceval-education_science,chatglm2-6b-hf/ceval-professional_tour_guide] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-environmental_impact_assessment_engineer.out

 12%|█▎        | 2/16 [03:03<17:38, 75.58s/it] 08/23 10:33:52 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-veterinary_medicine,chatglm2-6b-hf/ceval-modern_chinese_history,chatglm2-6b-hf/ceval-chinese_language_and_literature,chatglm2-6b-hf/ceval-legal_professional] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-veterinary_medicine.out

 19%|█▉        | 3/16 [03:03<08:55, 41.18s/it]08/23 10:33:53 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-college_chemistry,chatglm2-6b-hf/ceval-metrology_engineer,chatglm2-6b-hf/ceval-mao_zedong_thought,chatglm2-6b-hf/ceval-law] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-college_chemistry.out
08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-marxism,chatglm2-6b-hf/ceval-high_school_politics,chatglm2-6b-hf/ceval-high_school_geography,chatglm2-6b-hf/ceval-ideological_and_moral_cultivation,chatglm2-6b-hf/ceval-high_school_chinese] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-marxism.out

 31%|███▏      | 5/16 [03:05<03:21, 18.36s/it]08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-physician,chatglm2-6b-hf/ceval-civil_servant] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-physician.out
08/23 10:33:54 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-computer_architecture,chatglm2-6b-hf/ceval-middle_school_biology,chatglm2-6b-hf/ceval-middle_school_politics,chatglm2-6b-hf/ceval-middle_school_chemistry] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-computer_architecture.out

 44%|████▍     | 7/16 [03:05<01:31, 10.18s/it]08/23 10:33:55 - OpenCompass - WARNING - task OpenICLInfer[chatglm2-6b-hf/ceval-college_programming,chatglm2-6b-hf/ceval-electrical_engineer] fail, see
/mnt/home/00054055/project/code/opencompass/outputs/chatglm2_6b_ceval/20230823_103049/logs/infer/chatglm2-6b-hf/ceval-college_programming.out

ceval-logic.out: (other .out file looks similarly)

08/23 10:33:01 - OpenCompass - INFO - Task [chatglm2-6b-hf/ceval-logic,chatglm2-6b-hf/ceval-middle_school_history,chatglm2-6b-hf/ceval-plant_protection,chatglm2-6b-hf/ceval-clinical_medicine]
/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/utils/manager.py:113: UserWarning: <class 'mmengine.logging.logger.MMLogger'> instance named of OpenCompass has been created, the method `get_instance` should not accept any other arguments
  warnings.warn(

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/7 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py", line 147, in <module>
    inferencer.run()
  File "/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py", line 60, in run
    self.model = build_model_from_cfg(model_cfg)
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/build.py", line 22, in build_model_from_cfg
    return MODELS.build(model_cfg)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/mnt/home/00054055/project/code/opencompass/opencompass/models/huggingface.py", line 78, in __init__
    self._load_model(path=path,
  File "/mnt/home/00054055/project/code/opencompass/opencompass/models/huggingface.py", line 110, in _load_model
    self.model = AutoModel.from_pretrained(path, **model_kwargs)
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/fileio.py", line 162, in auto_pt
    res = ori_auto_pt.__func__(cls, pretrained_model_name_or_path,
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/home/00054055/project/code/opencompass/opencompass/utils/fileio.py", line 138, in model_pt
    res = ori_model_pt.__func__(cls, pretrained_model_name_or_path,
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3260, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 717, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 298, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.90 GiB total capacity; 508.01 MiB already allocated; 46.81 MiB free; 510.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1921474) of binary: /mnt/home/00054055/project/code/opencompass/.venv/bin/python
Traceback (most recent call last):
  File "/mnt/home/00054055/project/code/opencompass/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/home/00054055/project/code/opencompass/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/home/00054055/project/code/opencompass/opencompass/tasks/openicl_infer.py FAILED

其他信息

I tried the on tag 0.1.2 and latest commit 0d574c0, results same. all models for all dataset meet OOM, even though opt350m, which should not happen in my machine, four nvidia-P100, each has 16G memory.
checkout commit 3715be6 works for me, I tried hf_chatglm2_6b、xverse-13b, both works well with same cfg.
I haven't tried on other commit.
It looks like the task only used my first GPU when I use default cfg for hf_chatglm2_6b (num_gpus=1) , which can found from command ouput attched above and nivdia-smi output, different with commit 3715be6. It should use all GPU.
I tried change num_gpus to 4, still OOM.
I didn't chane cfg except model path

open-compass / opencompass

[Bug] All models meet OOM in tag 0.1.2, but back to commit 3715be6 works #249

先决条件

问题类型

环境

重现问题 - 代码/配置示例

重现问题 - 命令或脚本

重现问题 - 错误信息

其他信息