intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.72k stars 1.27k forks source link

Chatglm3-6B KV cache demo can't run on Arc A750 #9895

Open KiwiHana opened 10 months ago

KiwiHana commented 10 months ago

Hi, OS: windows 10, Arc A750 Driver: 5081 请问chatglm3和Baichuan2-7B随着对话次数增加,内存不停增大。用这个KV cache demo也不能解决: demo link: https://github.com/intel-analytics/BigDL/blob/main/python/llm/portable-zip/chat.py#L201

bigdl-core-xe-21 2.5.0b20240111 bigdl-llm 2.5.0b20240111 intel-extension-for-pytorch 2.1.10+git8ff85d6 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi

python chat_chatglm3_kv.py --model-path="./models/chatglm3-6b-int4"
C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-01-12 18:06:39,767 - INFO - intel_extension_for_pytorch auto imported
C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\model.py:388: FutureWarning: replace_embedding is deprecated and will be removed in a future version, please use cpu_embedding instead.
  warnings.warn("replace_embedding is deprecated and will be removed in a future version,"
2024-01-12 18:06:39,936 - INFO - Converting the current model to sym_int4 format......
2024-01-12 18:06:47,620 - INFO - Converting the current model to sym_int4 format......
StartRecentKVCache: 4, 512

Human: hi
C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py:233: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
 (Triggered internally at C:/Users/Intel/Documents/ipex-21/scripts/intel-extension-for-pytorch/csrc/gpu/jit/fusion_pass.cpp:837.)
  query_layer = apply_rotary_pos_emb_chatglm(query_layer, rotary_pos_emb)
BigDL-LLM: Traceback (most recent call last):
  File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_chatglm3_kv.py", line 322, in <module>
    stream_chat(model=model,
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_chatglm3_kv.py", line 137, in stream_chat
    past_key_values = greedy_generate(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_chatglm3_kv.py", line 77, in greedy_generate
    outputs = model(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 937, in forward
    transformer_outputs = self.transformer(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 153, in chatglm2_model_forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 640, in forward
    layer_ret = layer(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 544, in forward
    attention_output, kv_cache = self.self_attention(
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 233, in chatglm2_attention_forward_8eb45c
    query_layer = apply_rotary_pos_emb_chatglm(query_layer, rotary_pos_emb)
NotImplementedError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Could not run 'torch_ipex::mul_add' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torch_ipex::mul_add' is only available for these backends: [XPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

XPU: registered at C:/Users/Intel/Documents/ipex-21/scripts/intel-extension-for-pytorch/csrc/gpu/aten/operators/TripleOps.cpp:521 [kernel]
BackendSelect: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\functorch\DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\FunctionalizeFallbackKernel.cpp:290 [backend fallback]
Named: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\native\NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\torch\csrc\autograd\TraceTypeManual.cpp:296 [backend fallback]
AutocastCPU: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\autocast_mode.cpp:382 [backend fallback]
AutocastXPU: registered at C:/Users/Intel/Documents/ipex-21/scripts/intel-extension-for-pytorch/csrc/gpu/aten/operators/TripleOps.cpp:521 [kernel]
AutocastCUDA: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\autocast_mode.cpp:249 [backend fallback]
FuncTorchBatched: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\functorch\LegacyBatchingRegistrations.cpp:710 [backend fallback]
FuncTorchVmapMode: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\functorch\VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\functorch\TensorWrapper.cpp:203 [backend fallback]
PythonTLSSnapshot: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\functorch\DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at C:\Users\arc\ruijie\2.1_RC3\python310\frameworks.ai.pytorch.private-gpu\aten\src\ATen\core\PythonFallbackKernel.cpp:157 [backend fallback]
KiwiHana commented 10 months ago

Human: 桌子有左中右3个抽屉;张三,李四,王五,赵六都看到桌子上有一袋巧克力。张三让李四和王五出门后,在赵六面前把这袋巧克力放进了右抽屉;王五回来后,张三让赵六出门去找李四,并在王五面前从左抽屉拿出一盒饼干放进中抽屉里;等李四和赵六返回,张三又让王五和赵六出去买酱油,等二人走后,他告诉李四刚才已将一盒饼干放进中抽屉;张三等了很久,发现王五和赵六还没回来,就派李四去寻找,可最后只有王五和李四回来了。王五告诉张三,一开始他们没有找到卖酱油的店,所以只好分头去买,后来赵六走丢了;回来的路上,王五碰上了李四,两人便先赶了回来。于是,张三让两人出门去找赵六;为防再次走丢,张三叮嘱李四和王五要时刻同行,就算酱油买不到,也要找回赵六。结果,李四和王五在外面找到了赵六,发现他已经买了酱油。三人觉得张三从来不出门跑腿,十分气愤,讨论并达成共识,回去见到张三后,不要告诉他买到了酱油的事情,并让王五把酱油藏到自己的背包里。等三人一同回来后,他们按照计划谎称没有买到酱油,并希望张三以后买东西也要一同出门,不能偷懒,张三答应了。当大家最后站在桌子前,四人分别写下自己知道的物品清单和物品所在位置。问,这四人写下的物品和位置信息是否一致,为什么?

c:\program files\aigc assistant\resources\audiollm\chat_chatglm3_kv.py(155)chatglm3_stream_chat() -> if user_input == "stop": (Pdb) c BigDL-LLM: 这是一个有趣的逻辑谜题。我们可以按照描述的步骤来逐步进行分析:

  1. 张三、李四、王五都看到桌子上有一袋巧克力。
  2. 张三让李四和王五出门后,在赵六面前把这袋巧克力放进了右抽屉。
  3. 王五回来后,张三让赵六出门去找李四,并在王五面前从左抽屉拿出一盒饼干放进中抽屉里。
  4. 李四和赵六返回,张三又让王五和赵六出去买酱油,等二人走后,他告诉李四刚才已将一盒饼干放进中抽屉。
  5. 张三等了很久,发现王五和赵六还没回来,就派李四去寻找,可最后只有王五和李四回来了。
  6. 王五告诉张三,一开始他们没有找到卖酱油的店,所以只好分头去买,后来赵六走丢了。
  7. 回来的路上,王五碰上了李四,两人便先赶了回来。
  8. 张三让两人出门去找赵六;为防再次走丢,张三叮嘱李四和王五要时刻同行,就算酱油买不到,也要找回赵六。
  9. 结果,李四和王五在外面找到了赵六,发现他已经买了酱油。

根据这些步骤,我们可以发现:

问题在于,他们三人写的物品清单和位置信息是否一致。

答案是不一致的。因为:

所以,他们三人写的物品清单和位置信息不一致,因为他们在返回时将饼干放回的抽屉不同。 Human: 桌子有左中右3个抽屉;张三,李四,王五,赵六都看到桌子上有一袋巧克力。张三让李四和王五出门后,在赵六面前把这袋巧克力放进了右抽屉;王五回来后,张三让赵六出门去找李四,并在王五面前从左抽屉拿出一盒饼干放进中抽屉里;等李四和赵六返回,张三又让王五和赵六出去买酱油,等二人走后,他告诉李四刚才已将一盒饼干放进中抽屉;张三等了很久,发现王五和赵六还没回来,就派李四去寻找,可最后只有王五和李四回来了。王五告诉张三,一开始他们没有找到卖酱油的店,所以只好分头去买,后来赵六走丢了;回来的路上,王五碰上了李四,两人便先赶了回来。于是,张三让两人出门去找赵六;为防再次走丢,张三叮嘱李四和王五要时刻同行,就算酱油买不到,也要找回赵六。结果,李四和王五在外面找到了赵六,发现他已经买了酱油。三人觉得张三从来不出门跑腿,十分气愤,讨论并达成共识,回去见到张三后,不要告诉他买到了酱油的事情,并让王五把酱油藏到自己的背包里。等三人一同回来后,他们按照计划谎称没有买到酱油,并希望张三以后买东西也要一同出门,不能偷懒,张三答应了。当大家最后站在桌子前,四人分别写下自己知道的物品清单和物品所在位置。问,这四人写下的物品和位置信息是否一致,为什么?

c:\program files\aigc assistant\resources\audiollm\chat_chatglm3_kv.py(155)chatglm3_stream_chat() -> if user_input == "stop": (Pdb) c BigDL-LLM: Traceback (most recent call last): File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_chatglm3_kv.py", line 308, in chatglm3_stream_chat(model=model, tokenizer=tokenizer) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_chatglm3_kv.py", line 166, in chatglm3_stream_chat for response, chat_history, past_key_values in model.stream_chat(tokenizer, prompt, File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1073, in stream_chat for outputs in self.stream_generate(inputs, past_key_values=past_key_values, File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1160, in stream_generate outputs = self( File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 937, in forward transformer_outputs = self.transformer( File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 153, in chatglm2_model_forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 640, in forward layer_ret = layer( File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\A380/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 544, in forward attention_output, kv_cache = self.self_attention( File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 293, in chatglm2_attention_forward_8eb45c key_layer, value_layer = append_kv_cache(cache_k, cache_v, key_layer, value_layer) File "C:\ProgramData\miniconda3\envs\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\utils.py", line 55, in append_kv_cache new_cache_k = cache_k.as_strided(new_size, cache_k.stride(), storage_offset=0) RuntimeError: setStorage: sizes [1, 32, 1558, 128], strides [5251072, 164096, 128, 1], storage offset 0, and itemsize 4 requiring a storage size of 21145600 are out of bounds for storage of size 21004288

KiwiHana commented 10 months ago

Both exists Chatglm3-6B and Baichuan2-7B on Arc and MTL iGPU on windows.

qiyuangong commented 9 months ago

Both exists Chatglm3-6B and Baichuan2-7B on Arc and MTL iGPU on windows.

Hi @KiwiHana . Thank you for submitting this issue! :)

We have reproduced this issue on multiple platforms. The root cause is that we didn't allocate enough KV cache for multiple rounds stream_chat. That leads to OOM on KV cache storage rather than GPU memory. This issue also impacts speculative decoding examples.

PR #10006 will fix this issue.

KiwiHana commented 9 months ago

bigdl-llm 20240128 You can solve by putting libsycl-fallback-bfloat16.spv to your_env\Lib\site-packages\intel_extension_for_pytorch\bin

C:\Program Files\AIGC Assistant\resources\audiollm>..\llmsd_env\python.exe chat_0205.py C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-02-05 09:46:23,247 - INFO - intel_extension_for_pytorch auto imported C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\bigdl\llm\transformers\model.py:401: FutureWarning: replace_embedding is deprecated and will be removed in a future version, please use cpu_embedding instead. warnings.warn("replace_embedding is deprecated and will be removed in a future version," 2024-02-05 09:46:23,435 - INFO - Converting the current model to sym_int4 format......

Human: hi BigDL-LLM: Traceback (most recent call last): File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_0205.py", line 298, in chatglm3_stream_chat(model=model, tokenizer=tokenizer) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_0205.py", line 166, in chatglm3_stream_chat for response, chat_history, past_key_values in model.stream_chat(tokenizer, prompt, File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1072, in stream_chat for outputs in self.stream_generate(inputs, past_key_values=past_key_values, File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils_contextlib.py", line 56, in generator_context response = gen.send(request) File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1159, in stream_generate outputs = self( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 937, in forward transformer_outputs = self.transformer( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 172, in chatglm2_model_forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 640, in forward layer_ret = layer( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 544, in forward attention_output, kv_cache = self.self_attention( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 196, in chatglm2_attention_forward return forward_function( File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 295, in chatglm2_quantized_attention_forward_8eb45c attn = linear_q4_0.query_key_fp8_matmul(query_layer, key) / math.sqrt(head_dim) RuntimeError: Failed to load libsycl-fallback-bfloat16.spv -30 (PI_ERROR_INVALID_VALUE)

KiwiHana commented 9 months ago

我用bigdl0128版本with ipex2.1,代码不改动(max past token = 512),https://github.com/intel-analytics/BigDL/pull/10007/files MTL 16GB内存机器,每次输入400 token,输出大概是300token。chatglm3-6B跑到第12次的时候,就报错Native API returns: -999 (Unknown PI error),观察iGPU显存占用从第一次对话到第12次都在5.1~5.4GB之间,显存没有不停增加了,报错的时候是在第12次回答输出的中途报错的。

以下是出现报错的第12次对话

Human: 桌子有左中右3个抽屉;张三,李四,王五,赵六都看到桌子上有一袋巧克力。张三让李四和王五出门后,在赵六面前把这袋巧克力放进了右抽屉;王五回来后,张三让赵六出门去找李四,并在王五面前从左抽屉拿出一盒饼干放进中抽屉里;等李四和赵六返回,张三又让王五和赵六出去买酱油,等二人走后,他告诉李四刚才已将一盒饼干放进中抽屉;张三等了很久,发现王五和赵六还没回来,就派李四去寻找,可最后只有王五和李四回来了。王五告诉张三,一开始他们没有找到卖酱油的店,所以只好分头去买,后来赵六走丢了;回来的路上,王五碰上了李四,两人便先赶了回来。于是,张三让两人出门去找赵六;为防再次走丢,张三叮嘱李四和王五要时刻同行,就算酱油买不到,也要找回赵六。结果,李四和王五在外面找到了赵六,发现他已经买了酱油。三人觉得张三从来不出门跑腿,十分气愤,讨论并达成共识,回去见到张三后,不要告诉他买到了酱油的事情,并让王五把酱油藏到自己的背包里。等三人一同回来后,他们按照计划谎称没有买到酱油,并希望张三以后买东西也要一同出门,不能偷懒,张三答应了。当大家最后站在桌子前,四人分别写下自己知道的物品清单和物品所在位置。问,这四人写下的物品和位置信息是否一致,为什么?
BigDL-LLM: 这是一个经典的逻辑谜题。我们可以通过分析每个人的陈述来找出答案。

首先,我们可以看到四个人的陈述如下:
1. 张三:没有找到卖酱油的店。
2. 李四:没有找到卖酱油的店。
3. 王五:分的的开发票。
4. 赵六:没有找到卖酱油的店,在右抽屉。

让我们来分析一下:
- 张三说“没有找到卖酱油的店”,这意味着他并未找到卖酱油的店,所以他的陈述与实际情况不符。
- 李四说“没有找到卖酱油的店”,这与张三的陈述相符,所以他的陈述是正确的。
- 王五说“分的的开发票”,这与事实相符,所以他的陈述是正确的。
- 赵六说“没有找到卖酱油的店,在右抽屉”,这与事实相符。

由此可知,四个人中只有赵六的陈述与实际情况相符,其他三个人都在说谎。那么,为什么他们的物品和位置Traceback (most recent call last):
  File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_0205.py", line 298, in <module>
    chatglm3_stream_chat(model=model, tokenizer=tokenizer)
  File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Program Files\AIGC Assistant\resources\audiollm\chat_0205.py", line 166, in chatglm3_stream_chat
    for response, chat_history, past_key_values in model.stream_chat(tokenizer, prompt,
  File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils\_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1072, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\torch\utils\_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "C:\Users\test/.cache\huggingface\modules\transformers_modules\chatglm3-6b-int4\modeling_chatglm.py", line 1170, in stream_generate
    next_token_scores = logits_warper(input_ids, next_token_scores)
  File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\transformers\generation\logits_process.py", line 97, in __call__
    scores = processor(input_ids, scores)
  File "C:\Program Files\AIGC Assistant\resources\llmsd_env\lib\site-packages\transformers\generation\logits_process.py", line 315, in __call__
    indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)