intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.26k stars 1.23k forks source link

streamlit iGPU -RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) #10778

Open JamieVC opened 2 months ago

JamieVC commented 2 months ago

I followed installation guide https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html in order to run IPEX-LLM on iGPU of Meteor Lake with Windows OS platform.

---Steps to set environment --- (llm) PS C:\Users\S54 PR> pip install --pre --upgrade ipex-llm[xpu] (llm) PS C:\Users\S54 PR> pip install streamlit streamlit_chat (llm) PS C:\Users\S54 PR> New-Item -Path Env:\BIGDL_LLM_XMX_DISABLED -Value '1' (llm) PS C:\Users\S54 PR> New-Item -Path Env:\SYCL_CACHE_PERSISTENT -Value '1'

---Good when called by an IPEX-llm official demo--- It works well when I run demo_ipexllm.py that comes from https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html

image

---Error when called by a streamlit webapp---

I made a streamlit webapp that can be found in chat_streamlit_20240416.zip . Run it on Windows OS, and then I got error message below. Would you have suggestions to improve the issue?

(llm) PS C:\source\ipex\demo_igpu> streamlit run chat_streamlit.py -- --model_id "meta-llama/Llama-2-7b-chat-hf"

You can now view your Streamlit app in your browser.

Local URL: http://localhost:8505/ Network URL: http://10.174.192.123:8505/

C:\miniconda3\envs\llm\Lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\miniconda3\envs\llm\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-04-16 17:39:57,163 - INFO - intel_extension_for_pytorch auto imported LlamaModel() Loading models... Loading models... Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.33it/s] 2024-04-16 17:40:00,833 - INFO - Converting the current model to sym_int4 format...... Successfully loaded Tokenizer and optimized Model! Configuration... Configuration... question: question: time taken 0.00048689999857742805 time taken 0.0002223999999841908 Loading models... Configuration... question: what is human? time taken 0.0009090999992622528 Loading models... Configuration... question: what is human? Send! next_answer Preparing the response st empty build_inputs() prompt: [INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

what is human? [/INST] generate_iterate() TextIteratorStreamer() Thread() t.start() C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\models\llama.py:238: UserWarning: Passing padding_mask is deprecated and will be removed in v4.37.Please make sure use attention_mask instead.` warnings.warn( Exception in thread Thread-17 (generate): Traceback (most recent call last): File "C:\miniconda3\envs\llm\Lib\threading.py", line 1045, in _bootstrap_inner self.run() File "C:\miniconda3\envs\llm\Lib\threading.py", line 982, in run self._target(*self._args, self._kwargs) File "C:\miniconda3\envs\llm\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\transformers\generation\utils.py", line 1588, in generate return self.sample( ^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\transformers\generation\utils.py", line 2642, in sample outputs = self( ^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 806, in forward outputs = self.model( ^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 1980, in llama_model_forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 248, in llama_decoder_forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 334, in llama_attention_forward_4_31 return forward_function( ^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 399, in llama_attention_forward_4_31_quantized query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 685, in forward result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)

Thanks Jamie

Oscilloscope98 commented 2 months ago

Hi @JamieVC,

On our machine (MTL iGPU with 16GB mem), we did not reproduce your issue.

However, Native API returns: -999 seems to infer an OOM error. There are several things you could choose to do to save iGPU memory, especially if you only have 8GB memory for iGPU:

  1. Restart your machine to release idle iGPU memory usage
  2. set cpu_embedding=True in the from_pretrained function (seems that you have done this one)
  3. set IPEX_LLM_LOW_MEM=1 in your environment
  4. use model = model.half().to('xpu') instead of model = model.to('xpu')

Please let us know for any further problems :)