Closed bedovyy closed 1 month ago
So vllm 0 5.2 works, but 0.5.3p1 doesn't?
@mgoin
I see, thanks for reporting. This seems to be happening due to Gemma's more strict model loading in vLLM https://github.com/vllm-project/vllm/blob/d7a299edaa5d23f3d7d5c98b53872a8ced9aad80/vllm/model_executor/models/gemma.py#L405-L409 I will work on a fix.
gptq_marlin not found.
@linpan Please open a separate issue as this is unrelated. It seems you already have a gptq model, so you should not specify any --quantization
tag. vLLM will automatically convert it to gptq_marlin if it is able to
@mgoin I could load gemma-2-27b FP8 quants successfully on latest main branch. The response seems corrupted tho but it may be different issue (because GPTQ quants of gemma-2-27b was corrupted too)
thank you for quick fix!
Your current environment
🐛 Describe the bug
I could not launch api_server with gemma-2-27b-it-FP8D on 0.5.3post1.
I could launch with,
Log is as below,
I have quanitized FP8 dynamic using the below code.
you may also reproduce using
nm-testing/gemma-2-27b-it-FP8
on huggingface.