mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.59k stars 361 forks source link

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #40

Closed chnl closed 11 months ago

chnl commented 11 months ago

(streaming) F:\StreamingLLM\streaming-llm>python examples/run_streaming_llama.py --enable_streaming Loading model from lmsys/vicuna-13b-v1.3 ... You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.85s/it] Loading data from data/mt_bench.jsonl ... StartRecentKVCache: 4, 2000

USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.

ASSISTANT: Traceback (most recent call last): File "examples/run_streaming_llama.py", line 122, in main(args) File "examples/run_streaming_llama.py", line 103, in main streaming_inference( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "examples/run_streaming_llama.py", line 73, in streaming_inference past_key_values = greedy_generate( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "examples/run_streaming_llama.py", line 20, in greedy_generate outputs = model( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward outputs = self.model( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward layer_outputs = decoder_layer( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "f:\streamingllm\streaming-llm\streaming_llm\pos_shift\modify_llama.py", line 71, in llama_pos_shift_attention_forward query_states = self.q_proj(hidden_states) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: "addmm_implcpu" not implemented for 'Half'


(streaming) F:\StreamingLLM\streaming-llm> nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0


System Manufacturer: ASUS System Model: System Product Name System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: AMD64 Family 23 Model 49 Stepping 0 AuthenticAMD ~3701 Mhz BIOS Version: American Megatrends Inc. 1701, 12/13/2022 Windows Directory: C:\Windows System Directory: C:\Windows\system32

Guangxuan-Xiao commented 11 months ago

The error you're encountering stems from the use of half-precision ('Half') computations on a CPU. CPUs typically do not support half-precision computations.

To resolve this issue:

  1. Use a GPU: The demo script is optimized for GPU execution. If you have a compatible GPU, ensure you're utilizing it.
  2. Switch to FP32: If running on CPU is your only option, consider switching to full precision (FP32) for inference. This may result in slower computation but should bypass the error you're experiencing.

Hope this helps!

Guangxuan

chnl commented 11 months ago

Appreciated. I have a RTX 4090 installed w/ Cuda 11.8 - Not sure how the setup script doesn't recognize and default to leverage the GPU? I'll retrace my setup proceedure.