[Feature Request] 3-bit support

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

MIT License

341 stars 29 forks source link

[Feature Request] 3-bit support #117

Open mobicham opened 1 month ago

mobicham commented 1 month ago

Great work! Any chance you add support for 3-bit ? I know the bitpacking is a bit tricky with 3-bit, but it would be great to a have a 3-bit kernel for linear quantization, since the only one available is for LUT via flute, and 2-bit quantization quality for smaller pre-trained models is sub-optimal for production. Thanks!

### Tasks
- [ ] 3bit dtype support

LeiWang1999 commented 1 month ago

hi @mobicham , we will consider supporting it in our incoming release, is the flute implementation not optimal? (for example, have approximately 5x speedup than fp16 gemv)?

mobicham commented 1 month ago

In my benchmarks on the 3090, it's not that fast end-2-end. Llama3 8B decoding speed for 4-bit is about 67 tokens/sec with flute vs. 97 tokens/sec for torchao/bitblas (group-size=64, batch-size=1).
The quality tends to be better with LUT vs. linear quantization though, as expected, since linear quantization is just a special case of LUT. Linear quantization would run faster since there's no cost to read the LUT from the shared memory.

LeiWang1999 commented 1 month ago

@mobicham , got it! thanks for your sharing.

brisker commented 1 month ago

@LeiWang1999 Any benchmark speed test for w4a8 compared to fp16 ?

LeiWang1999 commented 1 month ago

@brisker , We provide the benchmark scripts for bitblas matmul:

https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py

brisker commented 1 month ago

@LeiWang1999

In the link you provide, I noticed that you have compared bitblas-w4a16 with marlin-w4a16, here

I want to ask, are they both tested on w4-per-channel quantized? (with no group-wise weight-quantize tricks.)
Is the w4a8 quantize pipeline integrated into vLLM yet?