Open mobicham opened 1 month ago
hi @mobicham , we will consider supporting it in our incoming release, is the flute implementation not optimal? (for example, have approximately 5x speedup than fp16 gemv)?
In my benchmarks on the 3090, it's not that fast end-2-end. Llama3 8B decoding speed for 4-bit is about 67 tokens/sec with flute vs. 97 tokens/sec for torchao/bitblas (group-size=64, batch-size=1).
The quality tends to be better with LUT vs. linear quantization though, as expected, since linear quantization is just a special case of LUT. Linear quantization would run faster since there's no cost to read the LUT from the shared memory.
@mobicham , got it! thanks for your sharing.
@LeiWang1999 Any benchmark speed test for w4a8 compared to fp16 ?
@brisker , We provide the benchmark scripts for bitblas matmul:
https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py
@LeiWang1999
In the link you provide, I noticed that you have compared bitblas-w4a16 with marlin-w4a16, here
I want to ask, are they both tested on w4-per-channel quantized? (with no group-wise weight-quantize tricks.)
Is the w4a8 quantize pipeline integrated into vLLM yet?
Great work! Any chance you add support for 3-bit ? I know the bitpacking is a bit tricky with 3-bit, but it would be great to a have a 3-bit kernel for linear quantization, since the only one available is for LUT via flute, and 2-bit quantization quality for smaller pre-trained models is sub-optimal for production. Thanks!