Open ChenVkl opened 7 months ago
Hi,
Thanks for raising this issue, want to confirm with you:
We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: https://github.com/intel-analytics/text-generation-webui/issues/25)
Hi,
Thanks for raising this issue, want to confirm with you:
- What's the initial GPU memory occupied by the system before you run the model inference?
- What input length do you chat with the model? context 730 is it this?
We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: intel-analytics/text-generation-webui#25)
Regarding your question, the initial GPU memory occupied by the system is approximately 1.6GB. When I chat with a large model, any simple question will result in an error message indicating that the memory is exceeded. Currently, I have switched to the Qwen-7b-int4 model to try whether it can run on BigDL. For specific issue, please refer to this link. https://github.com/intel-analytics/ipex-llm/issues/10616
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
export IPEX_LLM_LOW_MEM=1
before you launch the WebUIs
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
- Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and
export IPEX_LLM_LOW_MEM=1
before you launch the WebUI- Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.
Ok, I see the latest link, I'll give it a try. Thanks a lot.
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
- Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and
export IPEX_LLM_LOW_MEM=1
before you launch the WebUI- Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.
I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
- Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and
export IPEX_LLM_LOW_MEM=1
before you launch the WebUI- Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.
I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.
I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.
Some suggestions you may try to run on your side:
export IPEX_LLM_LOW_MEM=1
before you launch your webui.qwen-7b can run on Arc750
Some suggestions from our side for you to possibly run QWen-7B on Arc750:
- Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and
export IPEX_LLM_LOW_MEM=1
before you launch the WebUI- Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.
I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.
I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.
Some suggestions you may try to run on your side:
- https://github.com/intel-analytics/text-generation-webui Please use the latest code, as we have converted the activation precision to fp16.
export IPEX_LLM_LOW_MEM=1
before you launch your webui.
Thank you very much, I'll give it a try. In addition, I want to ask by the way, you said that you can run Qwen-7b with A750, which link do you use, could you please send it to me if it's convenient for you?
I'm using https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one with api transformer_int4_fp16_gpu
in config.yaml with export IPEX_LLM_LOW_MEM=1
and bash run-arc.sh
.
Is it the link you want?
can you clarify where export IPEX_LLM_LOW_MEM=1 needs to be put? When I type it in conda before starting the server.py I get:
'export' is not recognized as an internal or external command, operable program or batch file.
My arc A750 outputs the following error after a few interactions with the chatbot, which I assume it is memory related.
RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Running on windows please change it to set IPEX_LLM_LOW_MEM=1
thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get
RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)
should I open a new ticket?
thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get
RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)
should I open a new ticket?
Sure, you can open a new ticket and give more details about your settings (system, version, how you run, etc.). We will try to reproduce this. Thanks!
When I use A750 to run BigDL to load the Qwen-7b int4 model, it will show that the memory is exceeded, I don't know what's going on, is there a problem with my operation? The following is the error message: Traceback (most recent call last): File "D:\workspace\text-generation-webui-bigdl-llm\modules\text_generation.py", line 408, in generate_reply_HF shared.model.generate(generate_params) File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1259, in generate return super().generate( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 1525, in generate return self.sample( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 2622, in sample outputs = self( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1060, in forward lm_logits = self.lm_head(hidden_states) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\bigdl\llm\transformers\low_bit_linear.py", line 622, in forward result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype, RuntimeError: Allocation is out of device memory on current platform. Output generated in 12.06 seconds (0.00 tokens/s, 0 tokens, context 730, seed 290229866)