thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
269 stars 21 forks source link

GPU memory usage at benchmark #12

Closed Minami-su closed 5 months ago

Minami-su commented 5 months ago

I want to know the GPU memory usage,because I'm out of memory when I test the benchmark, model: Qwen1.5 0.5B Chat GPU: RTX3090 command:bash scripts/infintebench.sh bash scripts/longbench.sh result:cuda out of memory.

guyan364 commented 5 months ago

Hi! Could you provide more information, such as your configuration and which dataset you were evaluating when the out of memory issue occurred?

Minami-su commented 5 months ago

Hi! Could you provide more information, such as your configuration and which dataset you were evaluating when the out of memory issue occurred?

config.json

model:
  type: inf-llm
  path: Qwen1.5-0.5B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 8192
conv_type: qwen

bash scripts/longbench.sh

mkdir: cannot create directory ‘benchmark/longbench-result’: File exists
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Pred narrativeqa
  0%|                                                                                           | 0/200 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0%|                                                                                           | 0/200 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/luhao/InfLLM-main2/benchmark/pred.py", line 299, in <module>
    preds = get_pred(
  File "/home/luhao/InfLLM-main2/benchmark/pred.py", line 241, in get_pred
    output = searcher.generate(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 33, in generate
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 55, in _decode
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 98, in model_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 16, in hf_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/inf_llm.py", line 60, in forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 726, in append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 616, in append_global
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 20, in __init__
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/root/anaconda3/envs/train/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
guyan364 commented 5 months ago

I tested narrative qa on an A40 (48G) using your settings and limited the CUDA memory usage to 24G. No out of memory issue occurs. Did you use the model Qwen/Qwen1.5-0.5B-Chat from the Hugging Face Hub?