thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
269 stars 21 forks source link

OutOfResources: out of resource: shared memory, Required: 151680, Hardware limit: 101376. #7

Closed Minami-su closed 5 months ago

Minami-su commented 5 months ago

Can this be solved by adjusting the configuration parameters? If so, which one? I'm load_in_4bit=True config.json

model:
  type: inf-llm
  path: IA-14B-Chat2
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 8192
conv_type: mistral-inst
Traceback (most recent call last):
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 125, in <module>
    chat(config)
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 120, in chat
    conv.append(t)
  File "/home/luhao/InfLLM-main/inf_llm/chat.py", line 71, in append
    gen_text = self.searcher.generate(input_ids = new_tokens, max_length=self.max_gen, chunk_size=self.chunk_size, output=True,extra_end_token_ids=[self.tokenizer.bos_token_id,self.tokenizer.pad_token_id,self.tokenizer.eos_token_id], top_k=20, top_p=0.9, temperature=0.95, do_sample=True, repetition_penalty=1.05)[0]
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 33, in generate
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 55, in _decode
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 97, in model_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 16, in hf_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/inf_llm.py", line 54, in forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 558, in append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 520, in _append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 333, in calc_result_and_score
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 364, in mq_attn_triton
    return _attention.apply(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 312, in forward
    o, m, l = _forward(q1, k1, v1, mask1, sm_scale, sliding_window=sliding_window1)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/mq_attn_triton.py", line 246, in _forward
    _attn_fwd[grid](
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 232, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 65, in _attn_fwd
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/compiler/compiler.py", line 579, in __getattribute__
    self._init_handles()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/triton/compiler/compiler.py", line 568, in _init_handles
    raise OutOfResources(self.shared, max_shared, "shared memory")
triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 151680, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
guyan364 commented 5 months ago

Try setting fattn: false

We are currently using Triton for our custom Flash Attention implementation, and it costs more shared memory than the original CUDA version. We plan to refactor this in the future.

Also, we have not tested the compatibility of our implementation with quantized models yet. The Pytorch version may have better compatibility.

If you encounter other questions, please feel free to ask, thank you for your support!

Minami-su commented 5 months ago

Try setting fattn: false

We are currently using Triton for our custom Flash Attention implementation, and it costs more shared memory than the original CUDA version. We plan to refactor this in the future.

Also, we have not tested the compatibility of our implementation with quantized models yet. The Pytorch version may have better compatibility.

If you encounter other questions, please feel free to ask, thank you for your support!

Ok,resolved! image