Open maxin9966 opened 1 month ago
Can you share the failure? It should work AFAIK
@mgoin Running gemma-2-27b-gptq through vllm produce all outputs as pad pad pad , I have tested different versions of flashinfer and vllm, and the results are the same, could you tell me how to run the 4-bit quantization version of gemma2-27b-it?
vllm: VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype float16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager
output: [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='
Do you have a reference of that model working in Transformers or some other inference engine? Maybe this is worth reporting as an issue on that specific model card
@mgoin I can't find any other int4 quantized versions on HF, and everything on HF is in gguf format. Where can I find the int4 version?
I don't know, that is up to the community to produce an int4 checkpoint. I am just saying we don't know if the output you reported for that checkpoint is normal for that checkpoint or different in vLLM. It says it was made by GPTQModel: https://github.com/ModelCloud/GPTQModel
🚀 The feature, motivation and pitch
How to run the int4 quantized version of the gemma2-27b model
Alternatives
No response
Additional context
No response