W/ or w/o Weight quantization?

jy-yuan / KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

https://arxiv.org/abs/2402.02750

MIT License

121 stars 10 forks source link

W/ or w/o Weight quantization? #6

Closed deephanson94 closed 2 weeks ago

deephanson94 commented 1 month ago

Nice work! I wonder if the experiments and results in your paper on KIVI are using quantised weights? Or are they in FP16 weights?

henryzhongsc commented 1 month ago

Thanks for the nice words!

The results reported in our paper are based on FP16 weight, as we want to isolate the effect of KV cache quantization. However, we understand quantization techniques are often applied jointly, and we (or specifically, @jy-yuan for proper credit) did conduct additional experiments on applying KIVI to weight-quantized models.

In short, it is decent. From a quick look at jy's results on Llama-2-7b, W8-KIVI2 is roughly the same as W16-KIVI2 with most of the LongBench tasks, but has a 1%-ish drop to W16-KIVI2 on more challenging tasks (recall that W16-KIVI2 already has a roughly 1%-ish drop to W16-KV16 on those challenging tasks).

If @jy-yuan has time, maybe he can post some actual numbers here.

deephanson94 commented 1 month ago

Thanks for the reply!

If you guys have W4-KIVI2 results to share it would be amazing too. Most of the LLM nowadays are quant to 4bits weights

jy-yuan commented 1 month ago

Thanks for raising this; as @henryzhongsc mentioned, our KIVI method can be seamlessly combined with weight quantization. Here, I can show some actual numbers:

On CoQA, the 8-bit-weight Llama-2-7B has an accuracy of 63.6, while when combined with the KIVI 2-bit kvcache quantization, the accuracy is maintained at 63.4. In tasks from LongBench, the 8-bit model has a mean performance of 44.96, and when combined with KIVI 2-bit kvcache, it still achieves 44.72.

We plan to add more results to our updated manuscript later on.

Thanks!

deephanson94 commented 1 month ago

cool, seem like with 8 bits W quant + KIVI 2bit KV cache, the results are still on-par with higher precision KV cache.

Looking forward to seeing 4 bits W quant results with KIVI KV cache! awesome work.