intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

Error loading model when use qwen gguf model #96

Closed kunger97 closed 9 months ago

kunger97 commented 10 months ago

it's a qwen base model download form hf, can inferencing with llama.cpp(latest version) but can't inferencing on latest version of neural-speed run_qwen shows error:

Loading the bin file with GGUF format...
error loading model: unrecognized tensor type 13
Zhenzhong1 commented 10 months ago

@kunger97 Hi, we have fixed this issue https://github.com/intel/neural-speed/pull/84. Please install the Neural Speed from the source code and try again~~

kunger97 commented 10 months ago

@Zhenzhong1 Hello, I tested the following with the latest version of the neural-speed build

  1. install the latest version of llama.cpp
  2. pull the latest Qwen 14B model from HF repository Qwen/Qwen-14B-Chat
  3. Use the python script convert-hf-to-gguf.py in llama.cpp to perform the conversion.
    python convert-hf-to-gguf.py ~/Models/Qwen-14B-Chat/ --outtype f16
  4. quantize using the quantize program built in llama.cpp
    ./quantize ~/Models/Qwen-14B-Chat/ggml-model-f16.gguf 15 #Q4_K
  5. perform inference using the run_qwen program from neural-speed
    ./run_qwen -m ~/Models/Qwen-14B-Chat/ggml-model-Q4_K.gguf -p "你好。"

    The program still indicates an error and exits (Segmentation fault (core dumped))

    
    (neural-speed) u22f390a763ad8fc99b0d55cf8c167d0@idc-beta-batch-pvc-node-17:~$ ./run_qwen -m ~/Models/Qwen-14B-Chat/ggml-model-Q4_K.gguf -p "nihao"
    Welcome to use the qwen on the ITREX! 
    main: seed  = 1707018316
    AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
    model.cpp: loading model from /home/u22f390a763ad8fc99b0d55cf8c167d0/Models/Qwen-14B-Chat/ggml-model-Q4_K.gguf
    Loading the bin file with GGUF format...
    error loading model: unrecognized tensor type 13

model_init_from_file: failed to load model Segmentation fault (core dumped)

Zhenzhong1 commented 10 months ago

@kunger97 Thanks for your reply!

This error means you don't use the latest Neural Speed branch.

install the latest version of llama.cpp

Please reinstall the Nerual Speed from the source code. Not llama.cpp~

pip list | grep neural-speed image

pip uninstall neural-speed # please make sure you have uninstalled all neural-speed libs. Then python setup.py install in the neural speed root directory and try other models again if you want.

But for QWEN, it may not be supported currently by GGUF format supported models. When Neural Speed enabled the GGUF feature, there were no general GGUF QWEN model before. I will update this model of GGUF format in the Neural Speed as soon as possible.

Original Nerual Speed bin model format for Qwen should be OK.

Thank you again!

dellamuradario commented 10 months ago

Hi @Zhenzhong1 i get the same error:

error loading model: unrecognized tensor type 12
model_init_from_file: failed to load model

OS : WSL2 - Linux DESKTOP-PNBMAG8 5.15.133.1-microsoft-standard-WSL2 https://github.com/intel/neural-speed/pull/1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Python: Python 3.10.12

I tried:

pip uninstall neural-speed

git pull

pip list | grep neural-speed
neural-speed                  0.2.dev9+ge2d3652

python3 setup.py install

python3 scripts/inference.py --model_name llama -m /home/dario-reply/neural-speed-tutorial/llama-2-7b.Q4_K_M.gguf -c 512 -b 1024 -n 256 -t 10 --color -p "She opened the door and see"

The same error. What should i do? Thank you for your help and support.

Zhenzhong1 commented 10 months ago

@dellamuradario Hi~ your branch may not be the latest. Your neural-speed version is old 0.2.dev. Please git pull the latest main branch.

image

I saw you used llama-2-7b.Q4_K_M.gguf. This quantization type is not supported now. Please try q4_0.gguf.

And please use another script. Not infernece.py

try this:

# numactl -m 0 -C 0-55 is optional
# model_path should be the local llama HF model.
numactl -m 0 -C 0-55 python scripts/python_api_example_for_gguf.py --model_name llama --model_path /home/zhenzhong/model/Llama-2-7b-chat-hf/ -m /home/zhenzhong/model/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_0.gguf

Inference screenshot: image

dellamuradario commented 10 months ago

Thanks you @Zhenzhong1! Works!

Zhenzhong1 commented 9 months ago

By the way, QWEN has been supported. https://github.com/intel/neural-speed/pull/127