Closed Qubitium closed 7 months ago
Which model did you test? I've been running an SGLang Marlin branch since kernels were merged to vLLM.
Can you try one of my Marlin models, for example: https://huggingface.co/qeternity/Nous-Hermes-2-Mistral-7B-DPO-GPTQ-4bit-128g-actorder_False-Marlin
@qeternity Confirmed this is a compat bug when Marlin quantize is made with autogptq. @liurl21 will push an autogpt marlin quant PR within the hour to fix this. There appears be to two different methods of using quant_method. Not sure which is the "standard" but we will push this PR soon.
PR #290 created which fixed this compat issue.
AutoGPTQ actually has broken Marlin direct quantize support. Use my pending PR https://github.com/AutoGPTQ/AutoGPTQ/pull/586 to quant with it.
@qeternity In PR #286, Marlin kernel is merged but when is it actually used?
I have tested a marlin llama2 model (works on vllm) but not on latest sglang tip.