[Bug]: AssertionError when loading Qwen 2.5 GGUF q3 model in vLLM

frei-x commented 1 month ago

Your current environment

I'm encountering an AssertionError when trying to load the Qwen 2.5 GGUF (Qwen-2.5-q3_gguf.bin) model using vLLM. The error occurs in the vocab_parallel_embedding.py file, where it asserts that the loaded weight's shape matches the expected vocabulary size. Below is the traceback of the error:

Model Input Dumps

No response

🐛 Describe the bug

python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2.5-32B-Instruct-GGUF-q3_k_m/qwen2.5-32b-instruct-q3_k_m.gguf --dtype float16 --api-key '' --tensor-parallel-size 1 --trust-remote-code --gpu-memory-utilization 0.8 --port 8000 --max_model_len 10000 --enforce-eager --quantization gguf

gguf file It works fine in ollama

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Isotr0py commented 1 month ago

Refer to https://github.com/vllm-project/vllm/issues/7689#issuecomment-2299588012, you can use latest version transformers:

pip install git+https://github.com/huggingface/transformers

frei-x commented 1 month ago

参考 #7689 （comment），您可以使用最新版本：transformers
pip install git+https://github.com/huggingface/transformers

Success with transformers 4.45.0.dev0

mgoin commented 1 month ago

Thanks @Isotr0py !

vllm-project / vllm