多卡推理，内存溢出[Bug]

wjx-git commented 7 months ago

先决条件

[X] 我已经搜索过问题和讨论但未得到预期的帮助。
[X] 错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

环境正确，将max_seq_len设置为16k时可单卡正常推理，设置为32k时内存溢出。

重现问题 - 代码/配置示例

无

重现问题 - 命令或脚本

python run.py configs/eval_qwen_14b.py --reuse qwen_v1-5_14b

重现问题 - 错误信息

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.99 GiB. GPU 0 has a total capacty of 79.35 GiB of which 21.24 GiB is free. Process 142581 has 58.10 GiB memory in use. Of the allocated memory 24.01 GiB is allocated by PyTorch, and 33.60 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

其他信息

models = [ dict( type=HuggingFaceCausalLM, abbr='qwen-v1.5-14b-hf', path="/mnt/data/wujx/models/qwen/qwen-v1.5-14b", tokenizer_path="/mnt/data/wujx/models/qwen/qwen-v1.5-14b", model_kwargs=dict( device_map='auto', trust_remote_code=False, ), tokenizer_kwargs=dict( padding_side='left', truncation_side='left', trust_remote_code=True, use_fast=False, ), pad_token_id=151643, max_out_len=128, max_seq_len=32768, batch_size=1, run_cfg=dict(num_gpus=4, num_procs=1, max_num_workers=1), ) ]

我使用了4张A100-80G，测试 qwen-14b，最大窗口设为32k会导致内存溢出。

运行信息： [2024-02-06 08:22:43,250] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) 02/06 08:22:45 - OpenCompass - INFO - Reusing experiements from qwen_v1-5_14b 02/06 08:22:45 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored. 02/06 08:22:45 - OpenCompass - INFO - Partitioned into 1 tasks. launch OpenICLInfer[qwen-v1.5-14b-hf/LongBench_vcsum,qwen-v1.5-14b-hf/LongBench_narrativeqa,qwen-v1.5-14b-hf/LongBench_multifieldqa_zh,qwen-v1.5-14b-hf/LongBench_lsht,qwen-v1.5-14b-hf/LongBench_dureader,qwen-v1.5-14b-hf/LongBench_passage_retrieval_zh] on GPU 0,1,2,3

好像是每张卡都在独立推理。

我想要的是4张卡都提供给一个进程做推理，而不是4张卡4个进程。

该怎么配置呢？

WillWillWong commented 6 months ago

When I was trying to reason the 13b model, I loaded the data set using only one card to run. My environment is 2*3090, and there are the following problems.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 124.81 MiB is free. Including non-PyTorch memory, this process has 23.31 GiB memory in use. Of the allocated memory 23.06 GiB is allocated by PyTorch, and 2.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables).Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:04<00:09, 4.91s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:09<00:04, 4.95s/it] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:11<00:05, 5.97s/it] In fact, he only used the first card,

bittersweet1999 commented 6 months ago

[qwen-v1.5-14b-hf/LongBench_vcsum,qwen-v1.5-14b-hf/LongBench_narrativeqa,qwen-v1.5-14b-hf/LongBench_multifieldqa_zh,qwen-v1.5-14b-hf/LongBench_lsht,qwen-v1.5-14b-hf/LongBench_dureader,qwen-v1.5-14b-hf/LongBench_passage_retrieval_zh] For this tasks information, it seems that you use partitioner to allocate 4 tasks on 4 gpus, so if you want to use 4 gpus only for one task, just don't use partitioner will be ok. By the way, you can also use VLLM to do inference

wang99711123 commented 5 months ago

您好，我也遇到了相同的问题，窗口长度在32K时会OOM，请问您是怎么解决这个问题的？

disperaller commented 3 months ago

[qwen-v1.5-14b-hf/LongBench_vcsum,qwen-v1.5-14b-hf/LongBench_narrativeqa,qwen-v1.5-14b-hf/LongBench_multifieldqa_zh,qwen-v1.5-14b-hf/LongBench_lsht,qwen-v1.5-14b-hf/LongBench_dureader,qwen-v1.5-14b-hf/LongBench_passage_retrieval_zh] For this tasks information, it seems that you use partitioner to allocate 4 tasks on 4 gpus, so if you want to use 4 gpus only for one task, just don't use partitioner will be ok. By the way, you can also use VLLM to do inference

deepseek-v2-lite-chat, 16b, 32k, also oom. If switch to VLLM, it gets stuck and after 10 minutes of stucking, it pops out the following error:

open-compass / opencompass