intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Qwen-7B-Chat on Xeon+ARC770 #11219

Closed jianweimama closed 5 months ago

jianweimama commented 5 months ago

Qwen-7B-Chat:Precision (INT4_SYM) + Input/output token (1024/128) can run on ARC with below number. image

But Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128) can not run due to out of memory. As a comparison, llama2-7B-Chat Precision (FP16) + Input/output token (1024/128) can run, is it expected? What causes the difference in memory usage between these two models?

--------------------log of Qwen-7B-Chat Precision (FP16) + Input/output token (1024/128)------------------------------------------- Traceback (most recent call last): File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/usr/lib/python3.11/threading.py", line 982, in run self._target(*self._args, self._kwargs) File "/benchmark/all-in-one/run.py", line 55, in run_model_in_thread output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 1563, in generate return self.greedy_search( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 2385, in greedy_search outputs = self( ^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/benchmark_util.py", line 533, in call return self.model(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1060, in forward lm_logits = self.lm_head(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 822, in forward result = torch.ops.torch_ipex.matmul_bias_out(x, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 692, in call return self._op(args, **kwargs or {}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Allocation is out of device memory on current platform.

lalalapotter commented 5 months ago

Reproduced the issue in our env, which is expected. One of the reason that Qwen-7B use more memory than Llama2-7B is vocabulary size. Therefore, if you want to run fp16 precision, you could set cpu_embedding: True in config.yaml, using transformer_int4_fp16_gpu API. There may be other reasons leading to memory usage difference, you can also try low memory mode by using export IPEX_LLM_LOW_MEM=1 to save more memory.

jianweimama commented 5 months ago

with "cpu_embedding: True", Qwen-7B+fp16+1024/128 can work now.

thanks for help!