Closed chnl closed 11 months ago
The error you're encountering stems from the use of half-precision ('Half'
) computations on a CPU. CPUs typically do not support half-precision computations.
To resolve this issue:
Hope this helps!
Guangxuan
Appreciated. I have a RTX 4090 installed w/ Cuda 11.8 - Not sure how the setup script doesn't recognize and default to leverage the GPU? I'll retrace my setup proceedure.
(streaming) F:\StreamingLLM\streaming-llm>python examples/run_streaming_llama.py --enable_streaming Loading model from lmsys/vicuna-13b-v1.3 ... You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.85s/it] Loading data from data/mt_bench.jsonl ... StartRecentKVCache: 4, 2000USER: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.
ASSISTANT: Traceback (most recent call last): File "examples/run_streaming_llama.py", line 122, in
main(args)
File "examples/run_streaming_llama.py", line 103, in main
streaming_inference(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "examples/run_streaming_llama.py", line 73, in streaming_inference
past_key_values = greedy_generate(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "examples/run_streaming_llama.py", line 20, in greedy_generate
outputs = model(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 820, in forward
outputs = self.model(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\transformers\models\llama\modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "f:\streamingllm\streaming-llm\streaming_llm\pos_shift\modify_llama.py", line 71, in llama_pos_shift_attention_forward
query_states = self.q_proj(hidden_states)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "C:\Users\matt.conda\envs\streaming\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_implcpu" not implemented for 'Half'
(streaming) F:\StreamingLLM\streaming-llm> nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
System Manufacturer: ASUS System Model: System Product Name System Type: x64-based PC Processor(s): 1 Processor(s) Installed. [01]: AMD64 Family 23 Model 49 Stepping 0 AuthenticAMD ~3701 Mhz BIOS Version: American Megatrends Inc. 1701, 12/13/2022 Windows Directory: C:\Windows System Directory: C:\Windows\system32