vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.24k stars 3.3k forks source link

AWQ: Implement new kernels (64% faster decoding) #3025

Open casper-hansen opened 4 months ago

casper-hansen commented 4 months ago

According to my testing, it's possible to get even faster decoding than if you were to use ExLlamaV2 kernels. The prefilling speed is roughly the same as the current GEMM kernels (including the dequantize + torch.matmul trick).

Reference: https://github.com/casper-hansen/AutoAWQ/pull/365

simon-mo commented 4 months ago

PR welcomed! (or is there existing ones with ExLlamaV2?)

casper-hansen commented 4 months ago

PR welcomed! (or is there existing ones with ExLlamaV2?)

This is not about EXLV2 - my PR was just showcasing 64% faster decoding at batch size 32.

I am first looking to distribute models on HF before making any PR myself. This is essentially AWQ kernels version 2.0.

isaac-vidas commented 4 months ago

Would be great to be load these new AWQ models in vLLM. I tried a quantized version of LLaVA 1.5 in with the demo in https://github.com/mit-han-lab/llm-awq and the improvement is substantial.

@casper-hansen are there any pointers in how to load these new quantized models after converting the checkpoint to HF models? Perhaps other can contribute as well.