thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
269 stars 21 forks source link

Qwen1.5-72B-chat-AWQ with longbench and infinibench benchmark OOM with A100 80G #38

Open ehuaa opened 4 months ago

ehuaa commented 4 months ago

When I test Qwen1.5-72B-chat-AWQ with bash scripts/longbench.sh it turns out to OOM with A100 80G

My config: model: type: inf-llm path: /root/czh/quant_models/Qwen2-geogpt-72b-0412-awq-dde-12000 block_size: 128 n_init: 128 n_local: 4096 topk: 16 repr_topk: 4 max_cached_block: 32 exc_block_size: 512 fattn: false base: 1000000 distance_scale: 1.0

max_len: 2147483647 chunk_size: 2048 conv_type: qwen

The Traceback is as follows: Traceback (most recent call last): File "/root/czh/InfLLM/benchmark/pred.py", line 321, in preds = get_pred( File "/root/czh/InfLLM/benchmark/pred.py", line 256, in get_pred output = searcher.generate( File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate result = self._decode(input_ids, kwargs) File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode out = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1169, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward ret = forward( File "/root/czh/InfLLM/inf_llm/attention/inf_llm.py", line 64, in forward o = past_key_value.append( File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 774, in append chunk_o, local_score = self._append( File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 526, in _append attn.append( File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 96, in append self.finalize() File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 22, in finalize tmp = torch.masked_fill( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 190.19 MiB is free. Process 3985934 has 78.95 GiB memory in use. Of the allocated memory 75.61 GiB is allocated by PyTorch, and 2.82 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) /usr/local/lib/python3.10/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning') Evaluating on: ['result.json'] {} Can someone help with this issue? Thanks!