Open casper-hansen opened 4 months ago
PR welcomed! (or is there existing ones with ExLlamaV2?)
PR welcomed! (or is there existing ones with ExLlamaV2?)
This is not about EXLV2 - my PR was just showcasing 64% faster decoding at batch size 32.
I am first looking to distribute models on HF before making any PR myself. This is essentially AWQ kernels version 2.0.
Would be great to be load these new AWQ models in vLLM. I tried a quantized version of LLaVA 1.5 in with the demo in https://github.com/mit-han-lab/llm-awq and the improvement is substantial.
@casper-hansen are there any pointers in how to load these new quantized models after converting the checkpoint to HF models? Perhaps other can contribute as well.
According to my testing, it's possible to get even faster decoding than if you were to use ExLlamaV2 kernels. The prefilling speed is roughly the same as the current GEMM kernels (including the dequantize + torch.matmul trick).
Reference: https://github.com/casper-hansen/AutoAWQ/pull/365