intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) #11400

Open TriDefender opened 2 months ago

TriDefender commented 2 months ago

Traceback is as followed, I was running ChatGLM4-9b-chat on my laptop. Device configurations OS: Win 11 23H2 (22631.3737)

(llm_gpu) F:\LLM_Local>pip list Package Version


accelerate 0.23.0 annotated-types 0.7.0 bigdl-core-xe-21 2.5.0b20240620 bigdl-core-xe-addons-21 2.5.0b20240620 bigdl-core-xe-batch-21 2.5.0b20240620 certifi 2024.6.2 charset-normalizer 3.3.2 colorama 0.4.6 dpcpp-cpp-rt 2024.0.2 filelock 3.15.1 fsspec 2024.6.0 huggingface-hub 0.23.4 idna 3.7 intel-cmplr-lib-rt 2024.0.2 intel-cmplr-lic-rt 2024.0.2 intel-extension-for-pytorch 2.1.10+xpu intel-opencl-rt 2024.0.2 intel-openmp 2024.0.2 ipex-llm 2.1.0b20240620 Jinja2 3.1.4 MarkupSafe 2.1.5 mkl 2024.0.0 mkl-dpcpp 2024.0.0 mpmath 1.3.0 networkx 3.3 numpy 1.26.4 onednn 2024.0.0 onemkl-sycl-blas 2024.0.0 onemkl-sycl-datafitting 2024.0.0 onemkl-sycl-dft 2024.0.0 onemkl-sycl-lapack 2024.0.0 onemkl-sycl-rng 2024.0.0 onemkl-sycl-sparse 2024.0.0 onemkl-sycl-stats 2024.0.0 onemkl-sycl-vm 2024.0.0 packaging 24.1 pillow 10.3.0 pip 24.0 protobuf 5.27.1 psutil 5.9.8 py-cpuinfo 9.0.0 pydantic 2.7.4 pydantic_core 2.18.4 PyYAML 6.0.2rc1 regex 2024.5.15 requests 2.32.3 safetensors 0.4.3 sentencepiece 0.2.0 setuptools 69.5.1 sympy 1.13.0rc2 tabulate 0.9.0 tbb 2021.12.0 tiktoken 0.7.0 tokenizers 0.15.2 torch 2.1.0a0+cxx11.abi torchaudio 2.1.0.post2+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.4 transformers 4.36.2 typing_extensions 4.12.2 urllib3 2.2.1 wheel 0.43.0

The traceback is:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "F:\LLM_Local\cmdDemo.py", line 101, in <module>

    ^
  File "F:\LLM_Local\cmdDemo.py", line 87, in main

  File "C:\Users\%USR%\anaconda3\envs\llm_gpu\Lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "C:\Users\%USR%\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py", line 1007, in stream_chat
    inputs = inputs.to(self.device)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\%USR%\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\%USR%\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)

This seems to happen when I loaded the model and left it there idle after sometime,

TriDefender commented 2 months ago

Minimum code:

from transformers import AutoTokenizer
import torch

Res ='./glm-4-9b-chat/'
tokenizer = AutoTokenizer.from_pretrained(Res, trust_remote_code=True,encode_special_tokens=True)

from ipex_llm.transformers import AutoModel
model = AutoModel.from_pretrained(Res,load_in_4bit=True, trust_remote_code=True,optimize_model=False)
model=model.to('xpu')

#Preheat (borrowed from official example)
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
    model.generate(input_ids,do_sample=False, max_new_tokens=32)
def main():
    past_key_values = None
    history = []
    global stop_stream
    query = input("\n用户:")
    print("\nChatGLM:", end="")
    current_length = 0
    SYCL_CACHE_PERSISTENT=1
    BIGDL_LLM_XMX_DISABLED=1
    for response, history, past_key_values in model.stream_chat(tokenizer, query, history = history, top_p=0.75, temperature=0.9, repetition_penalty=1.2, past_key_values=past_key_values, return_past_key_values=True):
        if stop_stream:
            stop_stream = False
            break
        else:
            print(response[current_length:], end="", flush=True)
            current_length = len(response)
    print("")

if __name__ == "__main__":

    main()  

Yeah I definitively have to improve coding skills but atleast these works, for now Reproduction is possible by a wait of less than 10 mins

lzivan commented 2 months ago

Hi, Thank you for reporting this issue. I have successfully reproduced the problem and am currently working on a solution. I will update you as soon as we have resolved it.

lzivan commented 2 months ago

Hi, Here's our suggested solution: Try adding "cpu_embedding=True" to the "model" at line 8.

model = AutoModel.from_pretrained(Res,load_in_4bit=True, cpu_embedding=True, trust_remote_code=True,optimize_model=False)
TriDefender commented 2 months ago

Hi @lzivan, Thanks for your advice, it did solve most of the problems, I will do more extensive testing to see if something will happen after a longer wait period.

TriDefender commented 2 months ago

Still crashed after a longer wait, left the program idle


用户:睡不着怎么办?

ChatGLM:Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "F:\LLM_Local\cmdDemo.py", line 86, in <module>
    main()
    ^^^^^^
  File "F:\LLM_Local\cmdDemo.py", line 79, in main
    for response, history, past_key_values in model.stream_chat(tokenizer,query, history = history, top_p=0.75, temperature=0.9,repetition_penalty=1.2, past_key_values=past_key_values,return_past_key_values=True):
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py", line 1007, in stream_chat
    inputs = inputs.to(self.device)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
qiuxin2012 commented 2 months ago

We have run your minimum example successfully yesterday. (delete the stop_stream related code) Does our example work? https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4#example-2-stream-chat-using-stream_chat-api

RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) is caused by memory not enough. You can open your task manager to observe your memory usage. How many empty memory do you have before you run your program? And what happened after you start you program?

Another method you can try to save more memory is using fp16 model, change model=model.to('xpu') to model=model.half().to('xpu')

TriDefender commented 2 months ago

Hi @qiuxin2012, I have 32 gigabytes of physical memory in format of 16+16, other than that, windows configured 20 gigabytes of virtual memory. After loading the model and doing the first generation, the program takes up approximately 9 gigabytes of ram. During this test session, I hard rebooted my computer, the ram in usage was around 7-8 gigabytes prior to loading the model, it went to 15.7 gb with 15.4gb available. Here's how it went: I activated the environment which is given above, then I launched the python script containing a cli_streamchat interface. I left the program idle for 10 minutes approximately, I came back and I get the error above. Nevertheless, everything works fine if I started inference directly after loading the model. Therefore the only reason is that something happened during this wait period that resulted in the error. Your example certainly works because it starts inference directly after loading the model.

qiuxin2012 commented 2 months ago

@TriDefender Got it, we will try to reproduce your error.

bibekyess commented 2 months ago

Hello @qiuxin2012! I am running IPEXLLM-built ollama on Windows 11. I am also facing a similar error with Intel(R) Iris(R) Xe Graphics. I have 8GB VRAM and 16GB RAM. When I run gemma:2b-instruct-v1.1-q8_0 and chatfire/bge-m3(567M) for my RAG application, it works fine for first 2-3 calls and then it gives RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) The model gets unloaded from RAM itself after this error. It looks like 16GB RAM is not enough for running these two models, but interestingly if I use Ollama without IPEXLLM, (cpu-only version from ollama official repository), the memory error is not faced, so I am wondering if it is like IPEXLLM built ollama uses more memory than cpu-only ollama? The IPEXLLM-ollama log says that all the layers are offloaded to GPU, and indeed GPU memory is also being used, but I am very surprised on why is the RAM usage so high? Looks like the same model is being loaded twice on both RAM and VRAM when using IPEXLLM. If you have any comments/suggestions on why is memory usage so high, it would be nice to hear. Thank you!

TriDefender commented 2 months ago

Hi @bibekyess , If you are concerned about vram issue, you can try adding cpu_embedding=Truewhen loading the model, the model in RAM should be unloaded when moved to XPU, but yes, it is normal for the model to be first loaded in RAM then in VRAM, if you face certain issues concerning memory size you can add some physical memory or configure a larger swap area.

TriDefender commented 2 months ago

memory usage will increase after a few inferences because of all the token cached, try clearing the history and try again, the same situation happened here too, perhaps waiting a few minutes before your inference? It might be the same issue that we are both facing.

qiuxin2012 commented 2 months ago

Hello @qiuxin2012! I am running IPEXLLM-built ollama on Windows 11. I am also facing a similar error with Intel(R) Iris(R) Xe Graphics. I have 8GB VRAM and 16GB RAM. When I run gemma:2b-instruct-v1.1-q8_0 and chatfire/bge-m3(567M) for my RAG application, it works fine for first 2-3 calls and then it gives RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) The model gets unloaded from RAM itself after this error. It looks like 16GB RAM is not enough for running these two models, but interestingly if I use Ollama without IPEXLLM, (cpu-only version from ollama official repository), the memory error is not faced, so I am wondering if it is like IPEXLLM built ollama uses more memory than cpu-only ollama? The IPEXLLM-ollama log says that all the layers are offloaded to GPU, and indeed GPU memory is also being used, but I am very surprised on why is the RAM usage so high? Looks like the same model is being loaded twice on both RAM and VRAM when using IPEXLLM. If you have any comments/suggestions on why is memory usage so high, it would be nice to hear. Thank you!

Iris can use only half of your 16GB RAM, about 7.8 GB. If you use more then 7.8GB memory, you will get Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE). You can use task manager to see the igpu memory consumption.

image

But cpu ollama can use all your memory. Maybe your can try to increase your RAM.

bibekyess commented 2 months ago

@TriDefender Thank you for your response. Yeah, looks like we are in the same boat. :smile:

  1. In my case, I think the model first loads in RAM and then in VRAM, but the model in RAM doesn't get unloaded.
  2. How can I clear the cached tokens/history?

Actually, my waiting time or time-to-first-token is not that long (in minutes as you said), its somewhere between 10 to 20 seconds.

bibekyess commented 2 months ago

@qiuxin2012 Thank you for your response. It makes sense. I thought that Iris has a separate 8GB memory. So I was thinking that I had 16+8 GB memory in total. But based on your explanation, Iris uses half of the RAM memory so basically everything is loaded on that 16GB RAM only. Is my understanding correct?

qiuxin2012 commented 2 months ago

@qiuxin2012 Thank you for your response. It makes sense. I thought that Iris has a separate 8GB memory. So I was thinking that I had 16+8 GB memory in total. But based on your explanation, Iris uses half of the RAM memory so basically everything is loaded on that 16GB RAM only. Is my understanding correct?

Yes, you are right. The igpu memory is also shared with other programs. You can see from my task manager page, my laptop is connected to a 4k screen. The memory usable is only 5.4GB.

qiuxin2012 commented 2 months ago

Hi @qiuxin2012, I have 32 gigabytes of physical memory in format of 16+16, other than that, windows configured 20 gigabytes of virtual memory. After loading the model and doing the first generation, the program takes up approximately 9 gigabytes of ram. During this test session, I hard rebooted my computer, the ram in usage was around 7-8 gigabytes prior to loading the model, it went to 15.7 gb with 15.4gb available. Here's how it went: I activated the environment which is given above, then I launched the python script containing a cli_streamchat interface. I left the program idle for 10 minutes approximately, I came back and I get the error above. Nevertheless, everything works fine if I started inference directly after loading the model. Therefore the only reason is that something happened during this wait period that resulted in the error. Your example certainly works because it starts inference directly after loading the model.

@TriDefender We have reproduced your error, after I left the program idle for 5 minutes. I thought it was caused by the garbage collections. I found a work around. I tried to remove the warmup in your code, then I won't get the error.

with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
    model.generate(input_ids,do_sample=False, max_new_tokens=32)
TriDefender commented 2 months ago

Hi,@qiuxin2012 , I tried your workaround, but it was not helpful, here are the logs:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4'
2024-07-04 20:30:08,569 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 10/10 [00:02<00:00,  4.51it/s]
2024-07-04 20:30:11,686 - INFO - Converting the current model to sym_int4 format......
Sending model to GPU, please wait...
Successfully loaded Tokenizer and optimized Model!
欢迎使用 ChatGLM3-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序

用户:Who are you

This is the part where the warmup would normaly produce:

ChatGLM:C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
 (Triggered internally at C:/Users/arc/ruijie/2.1_RC3/python311/frameworks.ai.pytorch.ipex-gpu/csrc/gpu/jit/fusion_pass.cpp:837.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
I am an AI assistant named ChatGLM, which is developed based on the language model jointly trained by Tsinghua University KEG Lab and Zhipu.AI Company in 2024. My job is to provide appropriate answers and support for users' questions and requests.

Then I waited five minutes before continuing, and i got the same error

用户:What can you do?

ChatGLM:Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "F:\LLM_Local\cmdDemo.py", line 73, in <module>
    main()
    ^^^^^^
  File "F:\LLM_Local\cmdDemo.py", line 66, in main
    for response, history, past_key_values in model.stream_chat(tokenizer,query, history = history, top_p=0.75, temperature=0.9,repetition_penalty=1.2, past_key_values=past_key_values,return_past_key_values=True):
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py", line 1007, in stream_chat
    inputs = inputs.to(self.device)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)

It seems that after the first inference the program will only stay stable after a short moment(around 5 minutes), then no matter what you do the RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) will be thrown. Being idle or not in this period will not affect the appearance of this error.

TriDefender commented 2 months ago

This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.

bibekyess commented 2 months ago

Hi @qiuxin2012! I noticed one interesting observation. I face the above memory issue when running gemma:2b-instruct-v1.1-q8_0 but when I run a similar small variant (slightly different quantization and around 600MB file-size difference) there is no issue. gemma:2b-instruct-v1.1-q6_K together with chatfire/bge-m3 (567M) was running smoothly, using around 4GB memory. So, I am not sure if it is my device issue or a small bug in ipex-llm. Maybe worth to check by switching these two models. Thanks! :)

qiuxin2012 commented 2 months ago

Hi @qiuxin2012! I noticed one interesting observation. I face the above memory issue when running gemma:2b-instruct-v1.1-q8_0 but when I run a similar small variant (slightly different quantization and around 600MB file-size difference) there is no issue. gemma:2b-instruct-v1.1-q6_K together with chatfire/bge-m3 (567M) was running smoothly, using around 4GB memory. So, I am not sure if it is my device issue or a small bug in ipex-llm. Maybe worth to check by switching these two models. Thanks! :)

Can you open a new issue for this? I will find another colleague to follow your issue.

qiuxin2012 commented 2 months ago

This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.

Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.

用户:What‘s AI?
Chatglm:balabala normal output
5 minutes later
用户:What can you do?
Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
TriDefender commented 2 months ago

This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.

Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.

用户:What‘s AI?
Chatglm:balabala normal output
5 minutes later
用户:What can you do?
Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)

Is there anyway to resolve this issue?

qiuxin2012 commented 2 months ago

This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.

Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.

用户:What‘s AI?
Chatglm:balabala normal output
5 minutes later
用户:What can you do?
Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)

Is there anyway to resolve this issue?

I have no idea now, I will inform you if I find the solution.

lgdcky commented 1 month ago

i have the same problem,Is there any way to resolve this issue now?

qiuxin2012 commented 1 month ago

i have the same problem,Is there any way to resolve this issue now?

Not resolved.