intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.62k stars 1.26k forks source link

Qwen-7B-Chat don't support 2048 token input on A770 #9590

Open KiwiHana opened 10 months ago

KiwiHana commented 10 months ago

test script: bigdl all-in-one/run-arc.sh use model.half().to("xpu") instead of model.to("xpu") input prompt: 2048 .txt output 1024 token

/all-in-one$ ./run-arc.sh

:: initializing oneAPI environment ...
   run-arc.sh: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: ipp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

2023年 12月 04日 星期一 11:27:39 CST
T01 Cap mem
T08   32in  32out
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/                                io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don                                't plan on using image functionality from `torchvision.io`, you can ignore this                                 warning. Otherwise, there might be something wrong with your environment. Did yo                                u have `libjpeg` or `libpng` installed before building `torchvision` from source                                ?
  warn(
2023-12-04 11:27:45,566 - WARNING - Warning: please make sure that you are using                                 the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2                                023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要                                使用错误代码和模型。
Loading checkpoint shards: 100%|██████████████████| 8/8 [00:00<00:00, 16.78it/s]
2023-12-04 11:27:46,341 - INFO - Converting the current model to sym_int4 format                                ......
>> loading of model costs 51.830846463s
<class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenLMHeadModel'>
input length is:  torch.Size([1, 32])
model generate cost: 6.790750387000003
actual_out_len 32
model.first_cost, model.rest_cost_mean 5.428072226000012 0.038240609322580284
input length is:  torch.Size([1, 32])
model generate cost: 1.250412750999999
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.13376997499997856 0.03599299412903087
input length is:  torch.Size([1, 32])
model generate cost: 1.2456304680000017
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.13341257499999415 0.03585413038709912
input length is:  torch.Size([1, 32])
model generate cost: 1.246626799000012
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.1331302570000048 0.0358962405483873
Traceback (most recent call last):
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 555, in <mod                                ule>
    run_model(model, api, conf['in_out_pairs'], conf['local_model_hub'], conf['w                                arm_up'], conf['num_trials'], conf['num_beams'])
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 50, in run_m                                odel
    result = run_transformer_int4_gpu(repo_id, local_model_hub, in_out_pairs, wa                                rm_up, num_trials, num_beams,batch_size)
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 379, in run_                                transformer_int4_gpu
    output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_l                                en,
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 1561, in generate
    return self.greedy_search(
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 2382, in greedy_search
    outputs = self(
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 531, in __call__
    return self.model(*args, **kwargs)
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 1104, in forward
    transformer_outputs = self.transformer(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 934, in forward
    outputs = block(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 635, in forward
    attn_outputs = self.attn(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/bigd                                l/llm/transformers/models/qwen.py", line 189, in qwen_attention_forward
    attn_output, attn_weight = self._attn(
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 342, in _attn
    attn_weights = torch.where(
RuntimeError: Allocation is out of device memory on current platform.
qiuxin2012 commented 10 months ago

I can't reproduce your error in latest 2.5.0b20231205 bigdl-core-xe and bigdl-llm Could you share your os, driver, oneapi version?

My machine is ubuntu 22.04.3 with Linux 5.19.0-41-generic kernel, driver version ishttps://dgpu-docs.intel.com/releases/stable_736_25_20231031.html,oneapi is 2023.2.0。 You can see our recommended requirements here: https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#recommended-requirements

qiuxin2012 commented 10 months ago

Please make sure you qwen model is updated to the version of October 12th.

kevin-t-tang commented 10 months ago

Qwen-7B-Chat Upgrade to bigdl-llm 2.5.0b20231205

//Case 1 use model.half().to('xpu')

model.first_cost, model.rest_cost_mean 0.13400719300011588 0.030304098225813855 input length is: torch.Size([1, 1987]) model generate cost: 1.3971823479998875 actual_out_len 2

and gpu memory costs about 13440.93

//Case 2 use model.to('xpu')

model.first_cost, model.rest_cost_mean 0.1322723490000044 0.03364436958065286 input length is: torch.Size([1, 1987]) model generate cost: 1.1173496979999982 actual_out_len 2

and gpu memory costs about 10567.20