intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.7k stars 1.26k forks source link

BigDL-A750-Qwen7b-Allocation is out of device memory on current platform. #10575

Open ChenVkl opened 7 months ago

ChenVkl commented 7 months ago

When I use A750 to run BigDL to load the Qwen-7b int4 model, it will show that the memory is exceeded, I don't know what's going on, is there a problem with my operation? The following is the error message: Traceback (most recent call last): File "D:\workspace\text-generation-webui-bigdl-llm\modules\text_generation.py", line 408, in generate_reply_HF shared.model.generate(generate_params) File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1259, in generate return super().generate( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 1525, in generate return self.sample( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\transformers\generation\utils.py", line 2622, in sample outputs = self( File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Users\ZhangChen.cache\huggingface\modules\transformers_modules\Qwen-7B\modeling_qwen.py", line 1060, in forward lm_logits = self.lm_head(hidden_states) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "C:\Users\ZhangChen.conda\envs\llm\lib\site-packages\bigdl\llm\transformers\low_bit_linear.py", line 622, in forward result = linear_q4_0.forward_new(x_2d, self.weight.data, self.weight.qtype, RuntimeError: Allocation is out of device memory on current platform. Output generated in 12.06 seconds (0.00 tokens/s, 0 tokens, context 730, seed 290229866)

image 选定的照片
hkvision commented 7 months ago

Hi,

Thanks for raising this issue, want to confirm with you:

We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: https://github.com/intel-analytics/text-generation-webui/issues/25)

ChenVkl commented 7 months ago

Hi,

Thanks for raising this issue, want to confirm with you:

  • What's the initial GPU memory occupied by the system before you run the model inference?
  • What input length do you chat with the model? context 730 is it this?

We are converting the loaded model into fp16 for less memory usage as Arc750 only has 8g memory. (follow-up in this issue: intel-analytics/text-generation-webui#25)

Regarding your question, the initial GPU memory occupied by the system is approximately 1.6GB. When I chat with a large model, any simple question will result in an error message indicating that the memory is exceeded. Currently, I have switched to the Qwen-7b-int4 model to try whether it can run on BigDL. For specific issue, please refer to this link. https://github.com/intel-analytics/ipex-llm/issues/10616

hkvision commented 7 months ago

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

ChenVkl commented 7 months ago

s

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

Ok, I see the latest link, I'll give it a try. Thanks a lot.

ChenVkl commented 7 months ago

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

hkvision commented 7 months ago

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.

Some suggestions you may try to run on your side:

ChenVkl commented 7 months ago

qwen-7b can run on Arc750

Some suggestions from our side for you to possibly run QWen-7B on Arc750:

  • Use the latest ipex-llm (we have renamed from bigdl-llm to ipex-llm) and export IPEX_LLM_LOW_MEM=1 before you launch the WebUI
  • Could you clear up some applications that would occupy GPU memory? If before running our workload 1.6G is already occupied, then the remaining 6.4G may be challenging to run Qwen I suppose.

I'd like to ask if you have run Qwen with a 750 before, and how much GPU memory will it take? Thanks.

I haven't tried text generation webui, but for simple generate, qwen-7b can run on Arc750, for 256 input the memory I observe is 5290.11M. This memory value is from xpu-smi, and may not be the actual peak memory. I suppose the peak memory would be somewhat close or larger than 6g.

Some suggestions you may try to run on your side:

Thank you very much, I'll give it a try. In addition, I want to ask by the way, you said that you can run Qwen-7b with A750, which link do you use, could you please send it to me if it's convenient for you?

hkvision commented 7 months ago

I'm using https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one with api transformer_int4_fp16_gpu in config.yaml with export IPEX_LLM_LOW_MEM=1 and bash run-arc.sh. Is it the link you want?

Daroude commented 7 months ago

can you clarify where export IPEX_LLM_LOW_MEM=1 needs to be put? When I type it in conda before starting the server.py I get:

'export' is not recognized as an internal or external command, operable program or batch file.

My arc A750 outputs the following error after a few interactions with the chatbot, which I assume it is memory related.

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)

hkvision commented 7 months ago

Running on windows please change it to set IPEX_LLM_LOW_MEM=1

Daroude commented 7 months ago

thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)

should I open a new ticket?

hkvision commented 7 months ago

thanks, what doesn't seem to be have been then issue. As soon as about 2.000+ context is reached I get

RuntimeError: Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Output generated in 1.29 seconds (0.00 tokens/s, 0 tokens, context 2308, seed 1309198421)

should I open a new ticket?

Sure, you can open a new ticket and give more details about your settings (system, version, how you run, etc.). We will try to reproduce this. Thanks!