Closed hello-fri-end closed 1 month ago
Thank you for interesting in our work. Yes, it can be applied to any autoregressive model!
You can think KiVi as another implementation of KVCache. You can copy-paste the code here and modify it for the attention module in different models like MistralAttention.
Here we would like to note two important things we found when we extend KiVi to different models.
Transformer Package Version: Please double check the transformer package version. We test our implementation in 4.35.2. Yet in version >= 4.36, the KVCache data structure is changed.
Attention Implementation Variants: Please double check what is the attention mechanism used in the model (Multi-Head/Multi-Query/Group-Query). Currently we only release the CUDA and triton code for supporting Multi-Head Attention. For Multi-Query/Group-Query, it needs to change the low-level implementation a little bit. We will release this part soon.
Stay tuned for further developments!
Hey @jy-yuan, thank you for the awesome paper and code. Is the method/code only applicable for llama or can it be used with any autoregressive model? If it's the latter, are there instructions on how to quantize the kvcache of an arbitrary transformer model?