⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Apache License 2.0
2.14k
stars
211
forks
source link
After fine-tuning qwen2-1.5B-instruction and quantifying its AWQ, an error occurred while using Intel Extension for Transformers and CPU for inference. But when I used the same method to fine tune and quantify qwen1.5-4B chat before, I could use Intel Extension for Transformers to accelerate CPU inference. 对qwen2-1.5B-instruct微调并且awq量化后,使用intel-extension-for-transformers和CPU进行推理时出错。但我之前使用同样的方式微调和量化的qwen1.5-4B-chat时是可以使用intel-extension-for-transformers加速CPU推理的。 #1697
I don't know what the problem is, it seems to be a parameter error in the converted model, and everything works fine when I use the transformer and GPU.
model.cpp: loading model from runtime_outs/ne_qwen2_q_autoround.bin The number of ne_parameters is wrong. init: n_vocab = 151936 init: n_embd = 1536 init: n_mult = 8960 init: n_head = 12 init: n_head_kv = 0 init: n_layer = 28 init: n_rot = 128 init: ftype = 0 init: max_seq_len= 32768 init: n_ff = 8960 init: n_parts = 1 MODEL_ASSERT: /root/w0/workspace/neuralspeed-wheel-build/nlp_repo/neural_speed/./models/qwen/qwen.h:48: false /tmp/tmp9b4073w1: line 3: 55575 Aborted python /home/lmf/llm/Qwen2-finetuning/awq_intel_extension.py ERROR conda.cli.main_run:execute(124):
conda run python /home/lmf/llm/Qwen2-finetuning/awq_intel_extension.py
failed. (See above for error)