jy-yuan / KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
https://arxiv.org/abs/2402.02750
MIT License
121 stars 10 forks source link

Can this be used with any autogressive model? #1

Closed hello-fri-end closed 1 month ago

hello-fri-end commented 2 months ago

Hey @jy-yuan, thank you for the awesome paper and code. Is the method/code only applicable for llama or can it be used with any autoregressive model? If it's the latter, are there instructions on how to quantize the kvcache of an arbitrary transformer model?

zirui-ray-liu commented 2 months ago

Thank you for interesting in our work. Yes, it can be applied to any autoregressive model!

You can think KiVi as another implementation of KVCache. You can copy-paste the code here and modify it for the attention module in different models like MistralAttention.

Here we would like to note two important things we found when we extend KiVi to different models.

  1. Transformer Package Version: Please double check the transformer package version. We test our implementation in 4.35.2. Yet in version >= 4.36, the KVCache data structure is changed.

  2. Attention Implementation Variants: Please double check what is the attention mechanism used in the model (Multi-Head/Multi-Query/Group-Query). Currently we only release the CUDA and triton code for supporting Multi-Head Attention. For Multi-Query/Group-Query, it needs to change the low-level implementation a little bit. We will release this part soon.

Stay tuned for further developments!