intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.42k stars 1.23k forks source link

RuntimeError: could not create a primitive #11305

Open Liqiandi opened 2 months ago

Liqiandi commented 2 months ago

hi, I have successfully used the code below to test the speed of generating tokens when using qwen-7b under ipex.

`def main(model_dir = "Qwen/Qwen-7B-Chat"):

seed = 1024
max_experiment_times = 1
context_length_per_experiment = 1
generate_length_per_experiment = 2048

use_flash_attn = False
set_seed(seed)
print(torch.xpu.current_device())

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir,trust_remote_code=True)
model = optimize_model(model, low_bit = 'sym_int8').to('xpu:1')

print("模型参数以及配置加载完毕\n")

time_costs = []
context_str = '我' * context_length_per_experiment
max_gpu_memory_cost = 0
for _ in tqdm(range(max_experiment_times)):
    with torch.no_grad():
        input_ids = tokenizer.encode(context_str, return_tensors="pt").to('xpu:1')
    print("tokenizer执行结束\n")
    t1 = time.time()

    pred = model.generate(input_ids, min_length=generate_length_per_experiment+context_length_per_experiment, max_new_tokens=generate_length_per_experiment)
    output_str = tokenizer.decode(pred[0], skip_special_tokens=True)
    print(output_str)
    time_costs.append(time.time() - t1)
    max_gpu_memory_cost = max(max_gpu_memory_cost, torch.cuda.max_memory_allocated())
    torch.cuda.empty_cache()         

print(model_dir) 
print(f"time_costs = {sum(time_costs)}")
# print("Average generate speed (tokens/s): {}".format((max_experiment_times * generate_length_per_experiment) / sum(time_costs)))
print("Average generate speed (tokens/s): {}".format((max_experiment_times * pred.shape[1]) / sum(time_costs)))
print(f"GPU Memory cost: {max_gpu_memory_cost / 1024 / 1024 / 1024}GB")
print("Experiment setting: ")
print(f"seed = {seed}")
print(f"max_experiment_times = {max_experiment_times}")
print(f"context_length_per_experiment = {context_length_per_experiment}")
print(f"generate_length_per_experiment = {generate_length_per_experiment}")
print(f"use_flash_attn = {use_flash_attn}")
# print(f"quant_type = {quant_type}")
print("\n")`

But when I use your official code (https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one run.py)to test, such an error will occur. What is the reason please?

`Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00, 1.82it/s] 2024-06-13 15:30:06,365 - INFO - Converting the current model to sym_int8 format...... current_device: 1

loading of model costs 191.28394999999728s and 8.28125GB <class 'transformers_modules.qwen_7b_chat.modeling_qwen.QWenLMHeadModel'> C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\transformers\generation\configuration_utils.py:520: UserWarning: do_sample is set to False. However, top_p is set to 0.8 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. warnings.warn( C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\transformers\generation\configuration_utils.py:537: UserWarning: do_sample is set to False. However, top_k is set to 0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_k. warnings.warn( model generate cost: 4.547786500002985 Once upon a time, there was a young girl named Samantha who lived with her parents in a small town. Samantha had always dreamed of traveling the world and experiencing new cultures, but her parents were hesitant to let her go on such an adventure.

One day, Samantha stumbled upon a mysterious map that led to a hidden treasure model generate cost: 1.107833699999901 Once upon a time, there was a young girl named Samantha who lived with her parents in a small town. Samantha had always dreamed of traveling the world and experiencing new cultures, but her parents were hesitant to let her go on such an adventure.

One day, Samantha stumbled upon a mysterious map that led to a hidden treasure model generate cost: 1.343699099998048 Once upon a time, there was a young girl named Samantha who lived with her parents in a small town. Samantha had always dreamed of traveling the world and experiencing new cultures, but her parents were hesitant to let her go on such an adventure.

One day, Samantha stumbled upon a mysterious map that led to a hidden treasure model generate cost: 1.3271714000002248 Once upon a time, there was a young girl named Samantha who lived with her parents in a small town. Samantha had always dreamed of traveling the world and experiencing new cultures, but her parents were hesitant to let her go on such an adventure.

One day, Samantha stumbled upon a mysterious map that led to a hidden treasure onednn_verbose,info,oneDNN v3.3.0 (commit 887fb044ccd6308ed1780a3863c2c6f5772c94b3) onednn_verbose,info,cpu,runtime:threadpool,nthr:12 onednn_verbose,info,cpu,isa:Intel AVX2 with Intel DL Boost onednn_verbose,info,gpu,runtime:DPC++ onednn_verbose,info,gpu,engine,0,backend:Level Zero,name:Intel(R) UHD Graphics 770,driver_version:1.3.29283,binary_kernels:enabled onednn_verbose,info,gpu,engine,1,backend:Level Zero,name:Intel(R) Arc(TM) A770 Graphics,driver_version:1.3.29283,binary_kernels:enabled onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,info,experimental features are enabled onednn_verbose,info,use batch_normalization stats one pass is enabled onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,common,error,level_zero,errcode 1879048196 Traceback (most recent call last): File "D:\LQD\posefitness\posefitness\test\benchmark[run.py](http://run.py/)", line 1063, in run_transformer_int4_fp16_gpu_win output_ids = model.generate(input_ids, do_sample=False, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\utils\benchmark_util.py", line 1563, in generate return self.greedy_search( ^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\utils\benchmark_util.py", line 2385, in greedy_search outputs = self( ^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\utils\benchmark_util.py", line 533, in call return self.model(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\qwen_7b_chat\modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\transformers\models[qwen.py](http://qwen.py/)", line 496, in qwen_model_forward outputs = block( ^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\qwen_7b_chat\modeling_qwen.py", line 610, in forward attn_outputs = self.attn( ^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\transformers\models[qwen.py](http://qwen.py/)", line 74, in qwen_attention_forward qkv = self.c_attn(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\torch\nn\modules[module.py](http://module.py/)", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\miniconda3\envs\posefit\Lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 734, in forward result = xe_linear.forward_new(x_2d, self.weight.data, self.weight.qtype, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: could not create a primitive`

Liqiandi commented 2 months ago

It seems that have successfully inferred several times, but this error occurred in subsequent inferring?

violet17 commented 2 months ago

Which device do you use? Arc dicrete GPU on Windows? If you use Arc dicrete GPU on Windows, please disable intergrate GPU.

Liqiandi commented 2 months ago

I have arc 770 and UHD 770. After disable UHD 770, this question has resolved. Thanks!