Qwen-7B-Chat don't support 2048 token input on A770

KiwiHana commented 10 months ago

test script: bigdl all-in-one/run-arc.sh use model.half().to("xpu") instead of model.to("xpu") input prompt: 2048 .txt output 1024 token

/all-in-one$ ./run-arc.sh

:: initializing oneAPI environment ...
   run-arc.sh: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: ipp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

2023年 12月 04日 星期一 11:27:39 CST
T01 Cap mem
T08   32in  32out
/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torchvision/                                io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don                                't plan on using image functionality from `torchvision.io`, you can ignore this                                 warning. Otherwise, there might be something wrong with your environment. Did yo                                u have `libjpeg` or `libpng` installed before building `torchvision` from source                                ?
  warn(
2023-12-04 11:27:45,566 - WARNING - Warning: please make sure that you are using                                 the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2                                023.请使用最新模型和代码，尤其如果你在9月25日前已经开始使用Qwen-7B，千万注意不要                                使用错误代码和模型。
Loading checkpoint shards: 100%|██████████████████| 8/8 [00:00<00:00, 16.78it/s]
2023-12-04 11:27:46,341 - INFO - Converting the current model to sym_int4 format                                ......
>> loading of model costs 51.830846463s
<class 'transformers_modules.Qwen-7B-Chat.modeling_qwen.QWenLMHeadModel'>
input length is:  torch.Size([1, 32])
model generate cost: 6.790750387000003
actual_out_len 32
model.first_cost, model.rest_cost_mean 5.428072226000012 0.038240609322580284
input length is:  torch.Size([1, 32])
model generate cost: 1.250412750999999
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.13376997499997856 0.03599299412903087
input length is:  torch.Size([1, 32])
model generate cost: 1.2456304680000017
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.13341257499999415 0.03585413038709912
input length is:  torch.Size([1, 32])
model generate cost: 1.246626799000012
actual_out_len 32
model.first_cost, model.rest_cost_mean 0.1331302570000048 0.0358962405483873
Traceback (most recent call last):
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 555, in <mod                                ule>
    run_model(model, api, conf['in_out_pairs'], conf['local_model_hub'], conf['w                                arm_up'], conf['num_trials'], conf['num_beams'])
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 50, in run_m                                odel
    result = run_transformer_int4_gpu(repo_id, local_model_hub, in_out_pairs, wa                                rm_up, num_trials, num_beams,batch_size)
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/run.py", line 379, in run_                                transformer_int4_gpu
    output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_l                                en,
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 1561, in generate
    return self.greedy_search(
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 2382, in greedy_search
    outputs = self(
  File "/home/adc-a770/llm/bigdl/benchmark/all-in-one/../benchmark_util.py", lin                                e 531, in __call__
    return self.model(*args, **kwargs)
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 1104, in forward
    transformer_outputs = self.transformer(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 934, in forward
    outputs = block(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 635, in forward
    attn_outputs = self.attn(
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torc                                h/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/adc-a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/bigd                                l/llm/transformers/models/qwen.py", line 189, in qwen_attention_forward
    attn_output, attn_weight = self._attn(
  File "/home/adc-a770/.cache/huggingface/modules/transformers_modules/Qwen-7B-C                                hat/modeling_qwen.py", line 342, in _attn
    attn_weights = torch.where(
RuntimeError: Allocation is out of device memory on current platform.

qiuxin2012 commented 10 months ago

I can't reproduce your error in latest 2.5.0b20231205 bigdl-core-xe and bigdl-llm Could you share your os, driver, oneapi version?

My machine is ubuntu 22.04.3 with Linux 5.19.0-41-generic kernel, driver version ishttps://dgpu-docs.intel.com/releases/stable_736_25_20231031.html，oneapi is 2023.2.0。 You can see our recommended requirements here: https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#recommended-requirements

qiuxin2012 commented 10 months ago

Please make sure you qwen model is updated to the version of October 12th.

kevin-t-tang commented 10 months ago

Qwen-7B-Chat Upgrade to bigdl-llm 2.5.0b20231205

//Case 1 use model.half().to('xpu')

model.first_cost, model.rest_cost_mean 0.13400719300011588 0.030304098225813855 input length is: torch.Size([1, 1987]) model generate cost: 1.3971823479998875 actual_out_len 2

and gpu memory costs about 13440.93

//Case 2 use model.to('xpu')

model.first_cost, model.rest_cost_mean 0.1322723490000044 0.03364436958065286 input length is: torch.Size([1, 1987]) model generate cost: 1.1173496979999982 actual_out_len 2

and gpu memory costs about 10567.20

intel-analytics / ipex-llm

Qwen-7B-Chat don't support 2048 token input on A770 #9590