Open TriDefender opened 2 months ago
Minimum code:
from transformers import AutoTokenizer
import torch
Res ='./glm-4-9b-chat/'
tokenizer = AutoTokenizer.from_pretrained(Res, trust_remote_code=True,encode_special_tokens=True)
from ipex_llm.transformers import AutoModel
model = AutoModel.from_pretrained(Res,load_in_4bit=True, trust_remote_code=True,optimize_model=False)
model=model.to('xpu')
#Preheat (borrowed from official example)
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
model.generate(input_ids,do_sample=False, max_new_tokens=32)
def main():
past_key_values = None
history = []
global stop_stream
query = input("\n用户:")
print("\nChatGLM:", end="")
current_length = 0
SYCL_CACHE_PERSISTENT=1
BIGDL_LLM_XMX_DISABLED=1
for response, history, past_key_values in model.stream_chat(tokenizer, query, history = history, top_p=0.75, temperature=0.9, repetition_penalty=1.2, past_key_values=past_key_values, return_past_key_values=True):
if stop_stream:
stop_stream = False
break
else:
print(response[current_length:], end="", flush=True)
current_length = len(response)
print("")
if __name__ == "__main__":
main()
Yeah I definitively have to improve coding skills but atleast these works, for now Reproduction is possible by a wait of less than 10 mins
Hi, Thank you for reporting this issue. I have successfully reproduced the problem and am currently working on a solution. I will update you as soon as we have resolved it.
Hi, Here's our suggested solution: Try adding "cpu_embedding=True" to the "model" at line 8.
model = AutoModel.from_pretrained(Res,load_in_4bit=True, cpu_embedding=True, trust_remote_code=True,optimize_model=False)
Hi @lzivan, Thanks for your advice, it did solve most of the problems, I will do more extensive testing to see if something will happen after a longer wait period.
Still crashed after a longer wait, left the program idle
用户:睡不着怎么办?
ChatGLM:Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "F:\LLM_Local\cmdDemo.py", line 86, in <module>
main()
^^^^^^
File "F:\LLM_Local\cmdDemo.py", line 79, in main
for response, history, past_key_values in model.stream_chat(tokenizer,query, history = history, top_p=0.75, temperature=0.9,repetition_penalty=1.2, past_key_values=past_key_values,return_past_key_values=True):
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py", line 1007, in stream_chat
inputs = inputs.to(self.device)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
We have run your minimum example successfully yesterday. (delete the stop_stream
related code)
Does our example work? https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/glm4#example-2-stream-chat-using-stream_chat-api
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
is caused by memory not enough.
You can open your task manager to observe your memory usage. How many empty memory do you have before you run your program? And what happened after you start you program?
Another method you can try to save more memory is using fp16 model, change model=model.to('xpu')
to model=model.half().to('xpu')
Hi @qiuxin2012, I have 32 gigabytes of physical memory in format of 16+16, other than that, windows configured 20 gigabytes of virtual memory. After loading the model and doing the first generation, the program takes up approximately 9 gigabytes of ram. During this test session, I hard rebooted my computer, the ram in usage was around 7-8 gigabytes prior to loading the model, it went to 15.7 gb with 15.4gb available. Here's how it went: I activated the environment which is given above, then I launched the python script containing a cli_streamchat interface. I left the program idle for 10 minutes approximately, I came back and I get the error above. Nevertheless, everything works fine if I started inference directly after loading the model. Therefore the only reason is that something happened during this wait period that resulted in the error. Your example certainly works because it starts inference directly after loading the model.
@TriDefender Got it, we will try to reproduce your error.
Hello @qiuxin2012! I am running IPEXLLM-built ollama on Windows 11.
I am also facing a similar error with Intel(R) Iris(R) Xe Graphics
. I have 8GB VRAM and 16GB RAM. When I run gemma:2b-instruct-v1.1-q8_0
and chatfire/bge-m3
(567M) for my RAG application, it works fine for first 2-3 calls and then it gives RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
The model gets unloaded from RAM itself after this error. It looks like 16GB RAM is not enough for running these two models, but interestingly if I use Ollama without IPEXLLM, (cpu-only version from ollama official repository), the memory error is not faced, so I am wondering if it is like IPEXLLM built ollama uses more memory than cpu-only ollama?
The IPEXLLM-ollama log says that all the layers are offloaded to GPU, and indeed GPU memory is also being used, but I am very surprised on why is the RAM usage so high? Looks like the same model is being loaded twice on both RAM and VRAM when using IPEXLLM.
If you have any comments/suggestions on why is memory usage so high, it would be nice to hear. Thank you!
Hi @bibekyess ,
If you are concerned about vram issue, you can try adding cpu_embedding=True
when loading the model, the model in RAM should be unloaded when moved to XPU, but yes, it is normal for the model to be first loaded in RAM then in VRAM, if you face certain issues concerning memory size you can add some physical memory or configure a larger swap area.
memory usage will increase after a few inferences because of all the token cached, try clearing the history and try again, the same situation happened here too, perhaps waiting a few minutes before your inference? It might be the same issue that we are both facing.
Hello @qiuxin2012! I am running IPEXLLM-built ollama on Windows 11. I am also facing a similar error with
Intel(R) Iris(R) Xe Graphics
. I have 8GB VRAM and 16GB RAM. When I rungemma:2b-instruct-v1.1-q8_0
andchatfire/bge-m3
(567M) for my RAG application, it works fine for first 2-3 calls and then it givesRuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
The model gets unloaded from RAM itself after this error. It looks like 16GB RAM is not enough for running these two models, but interestingly if I use Ollama without IPEXLLM, (cpu-only version from ollama official repository), the memory error is not faced, so I am wondering if it is like IPEXLLM built ollama uses more memory than cpu-only ollama? The IPEXLLM-ollama log says that all the layers are offloaded to GPU, and indeed GPU memory is also being used, but I am very surprised on why is the RAM usage so high? Looks like the same model is being loaded twice on both RAM and VRAM when using IPEXLLM. If you have any comments/suggestions on why is memory usage so high, it would be nice to hear. Thank you!
Iris can use only half of your 16GB RAM, about 7.8 GB. If you use more then 7.8GB memory, you will get Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
. You can use task manager to see the igpu memory consumption.
But cpu ollama can use all your memory. Maybe your can try to increase your RAM.
@TriDefender Thank you for your response. Yeah, looks like we are in the same boat. :smile:
- In my case, I think the model first loads in RAM and then in VRAM, but the model in RAM doesn't get unloaded.
- How can I clear the cached tokens/history?
Actually, my waiting time or time-to-first-token is not that long (in minutes as you said), its somewhere between 10 to 20 seconds.
@qiuxin2012 Thank you for your response. It makes sense. I thought that Iris has a separate 8GB memory. So I was thinking that I had 16+8 GB memory in total. But based on your explanation, Iris uses half of the RAM memory so basically everything is loaded on that 16GB RAM only. Is my understanding correct?
@qiuxin2012 Thank you for your response. It makes sense. I thought that Iris has a separate 8GB memory. So I was thinking that I had 16+8 GB memory in total. But based on your explanation, Iris uses half of the RAM memory so basically everything is loaded on that 16GB RAM only. Is my understanding correct?
Yes, you are right. The igpu memory is also shared with other programs. You can see from my task manager page, my laptop is connected to a 4k screen. The memory usable is only 5.4GB.
Hi @qiuxin2012, I have 32 gigabytes of physical memory in format of 16+16, other than that, windows configured 20 gigabytes of virtual memory. After loading the model and doing the first generation, the program takes up approximately 9 gigabytes of ram. During this test session, I hard rebooted my computer, the ram in usage was around 7-8 gigabytes prior to loading the model, it went to 15.7 gb with 15.4gb available. Here's how it went: I activated the environment which is given above, then I launched the python script containing a cli_streamchat interface. I left the program idle for 10 minutes approximately, I came back and I get the error above. Nevertheless, everything works fine if I started inference directly after loading the model. Therefore the only reason is that something happened during this wait period that resulted in the error. Your example certainly works because it starts inference directly after loading the model.
@TriDefender We have reproduced your error, after I left the program idle for 5 minutes. I thought it was caused by the garbage collections. I found a work around. I tried to remove the warmup in your code, then I won't get the error.
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
model.generate(input_ids,do_sample=False, max_new_tokens=32)
Hi,@qiuxin2012 , I tried your workaround, but it was not helpful, here are the logs:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4'
2024-07-04 20:30:08,569 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 10/10 [00:02<00:00, 4.51it/s]
2024-07-04 20:30:11,686 - INFO - Converting the current model to sym_int4 format......
Sending model to GPU, please wait...
Successfully loaded Tokenizer and optimized Model!
欢迎使用 ChatGLM3-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序
用户:Who are you
This is the part where the warmup would normaly produce:
ChatGLM:C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
(Triggered internally at C:/Users/arc/ruijie/2.1_RC3/python311/frameworks.ai.pytorch.ipex-gpu/csrc/gpu/jit/fusion_pass.cpp:837.)
query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
I am an AI assistant named ChatGLM, which is developed based on the language model jointly trained by Tsinghua University KEG Lab and Zhipu.AI Company in 2024. My job is to provide appropriate answers and support for users' questions and requests.
Then I waited five minutes before continuing, and i got the same error
用户:What can you do?
ChatGLM:Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "F:\LLM_Local\cmdDemo.py", line 73, in <module>
main()
^^^^^^
File "F:\LLM_Local\cmdDemo.py", line 66, in main
for response, history, past_key_values in model.stream_chat(tokenizer,query, history = history, top_p=0.75, temperature=0.9,repetition_penalty=1.2, past_key_values=past_key_values,return_past_key_values=True):
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\torch\utils\_contextlib.py", line 35, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "C:\Users\18913\.cache\huggingface\modules\transformers_modules\modeling_chatglm.py", line 1007, in stream_chat
inputs = inputs.to(self.device)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\18913\anaconda3\envs\llm_gpu\Lib\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
self.data = {k: v.to(device=device) for k, v in self.data.items()}
^^^^^^^^^^^^^^^^^^^
RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
It seems that after the first inference the program will only stay stable after a short moment(around 5 minutes), then no matter what you do the RuntimeError: Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
will be thrown. Being idle or not in this period will not affect the appearance of this error.
This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.
Hi @qiuxin2012!
I noticed one interesting observation. I face the above memory issue when running gemma:2b-instruct-v1.1-q8_0
but when I run a similar small variant (slightly different quantization and around 600MB file-size difference) there is no issue. gemma:2b-instruct-v1.1-q6_K
together with chatfire/bge-m3
(567M) was running smoothly, using around 4GB memory.
So, I am not sure if it is my device issue or a small bug in ipex-llm. Maybe worth to check by switching these two models.
Thanks! :)
Hi @qiuxin2012! I noticed one interesting observation. I face the above memory issue when running
gemma:2b-instruct-v1.1-q8_0
but when I run a similar small variant (slightly different quantization and around 600MB file-size difference) there is no issue.gemma:2b-instruct-v1.1-q6_K
together withchatfire/bge-m3
(567M) was running smoothly, using around 4GB memory. So, I am not sure if it is my device issue or a small bug in ipex-llm. Maybe worth to check by switching these two models. Thanks! :)
Can you open a new issue for this? I will find another colleague to follow your issue.
This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.
Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.
用户:What‘s AI?
Chatglm:balabala normal output
5 minutes later
用户:What can you do?
Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.
Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.
用户:What‘s AI? Chatglm:balabala normal output 5 minutes later 用户:What can you do? Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
Is there anyway to resolve this issue?
This shouldn't be a OOM issue, because I do have enough physical ran and swapfile.
Yes, it won't be a OOM issue. I think is caused by garbage collection. But I don't know why. I just check my code. I'm using a saved Sym Int4 model, a little different from yours. But I got this error again, after I left the program idle for 5 minites before I ask the second question.
用户:What‘s AI? Chatglm:balabala normal output 5 minutes later 用户:What can you do? Chatglm:Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
Is there anyway to resolve this issue?
I have no idea now, I will inform you if I find the solution.
i have the same problem,Is there any way to resolve this issue now?
i have the same problem,Is there any way to resolve this issue now?
Not resolved.
Traceback is as followed, I was running ChatGLM4-9b-chat on my laptop. Device configurations OS: Win 11 23H2 (22631.3737)
The traceback is:
This seems to happen when I loaded the model and left it there idle after sometime,