vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.11k stars 3.98k forks source link

[Bug]: AssertionError when loading Qwen 2.5 GGUF q3 model in vLLM #8697

Open frei-x opened 11 hours ago

frei-x commented 11 hours ago

Your current environment

I'm encountering an AssertionError when trying to load the Qwen 2.5 GGUF (Qwen-2.5-q3_gguf.bin) model using vLLM. The error occurs in the vocab_parallel_embedding.py file, where it asserts that the loaded weight's shape matches the expected vocabulary size. Below is the traceback of the error:

Model Input Dumps

No response

🐛 Describe the bug

python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2.5-32B-Instruct-GGUF-q3_k_m/qwen2.5-32b-instruct-q3_k_m.gguf --dtype float16 --api-key '' --tensor-parallel-size 1 --trust-remote-code --gpu-memory-utilization 0.8 --port 8000 --max_model_len 10000 --enforce-eager --quantization gguf

gguf file It works fine in ollama

Before submitting a new issue...

Isotr0py commented 9 hours ago

Refer to https://github.com/vllm-project/vllm/issues/7689#issuecomment-2299588012, you can use latest version transformers:

pip install git+https://github.com/huggingface/transformers
frei-x commented 7 hours ago

参考 #7689 (comment),您可以使用最新版本:transformers

pip install git+https://github.com/huggingface/transformers

Success with transformers 4.45.0.dev0