Qwen1.5-7B-Chat CUDA error: out of memory

yinochaos commented 5 months ago

机器配置：A800 80G，机器内存360G 配置文件：

model:
  type: inf-llm
  path: Qwen/Qwen1.5-7B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 512
conv_type: qwen

修改pred进行推理，输入token长度大约为28W左右【token长度在19W以内是不会报错的】报错信息

Traceback (most recent call last):
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 325, in <module>
    preds = get_pred(
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 271, in get_pred
    output = searcher.generate(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
    result = self._decode(input_ids, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
    out = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
    layer_outputs = decoder_layer(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
    ret = forward(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/inf_llm.py", line 58, in forward
    o = past_key_value.append(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 725, in append
    self.append_global(ed - st, kv_ed - kv_st, local_score)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 620, in append_global
    MemoryUnit(self.global_remainder[0][u, :, global_remainder_st:global_remainder_st + self.block_size, :],
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 34, in __init__
    cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()
RuntimeError: CUDA error: out of memory

请问如何解决这个问题，我看显存最大也就是30+G的占用，是哪里出的问题呢？

guyan364 commented 5 months ago

你好，可能是 pin memory 的问题，把 Memory Unit 的 pin memory 去掉试一试

yinochaos commented 5 months ago

你好，可能是 pin memory 的问题，把 Memory Unit 的 pin memory 去掉试一试

你好我把pin memory去掉之后还是一样的报错：

    cpu_data = data.contiguous().to("cpu", non_blocking=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

其他一些相关环境信息： Python 3.10.14 Driver Version: 470.161.03 CUDA Version: 12.1 torch: 2.2.2+cu121 transformers:4.39.2

guyan364 commented 5 months ago

抱歉我们目前没有相同的测试环境，不能复现你的问题或许可以试一试 CUDA 11.8 的 torch

yinochaos commented 5 months ago

抱歉我们目前没有相同的测试环境，不能复现你的问题或许可以试一试 CUDA 11.8 的 torch

好的，感谢

ehuaa commented 4 months ago

@yinochaos 您好，想问这个问题最后怎么解决的呢？

Becomebright commented 5 days ago

OOM应该是爆内存了，有两种可能：

内存映射区域数量的限制，可以参考把vm.max_map_count调大
内存容量不足，那只能offload到硬盘上了

thunlp / InfLLM

Qwen1.5-7B-Chat CUDA error: out of memory #21