Open ThisisBillhe opened 1 day ago
Hey @ThisisBillhe !
Yes, the current implementation is not really same as KIVI. As you noted we don''t have have sliding window and we don't use the same quantization as folks from KIVI. KIVI has support for concatenating quantized tensors and for matmuls, while what we do is simply quantize using any open library as backend, and then dequantize it back before concatenating with full precision key/values
The quantized cache in transformers
is rather inspired by the idea from KIVI paper, so it's not a full copy. At the same time it gives flexibility to switch up _quantize
and _dequantize
methods for users to try out different quantization frameworks/methods/params. Improving the cache in terms of latency was an options but unfortunately I am short on bandwidth. Feel free to open a PR to make the cache store key/values in sliding window fashion, if you want to give it a try 😄
System Info
transformers 4.45.2
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
As defined in L693-L719 in transformers/cache_utils.py, QuantizedCache will quantize all tokens in the prefill phase (L693-L698) without keeping recent tokens in full-precision (residual_length). Also, during the decoding phase, when self.key_cache[layer_idx].shape[-2] + 1 >= self.residual_length, all tokens are quantized without exception. I think KIVI uses a sliding window and always keeps recent tokens of residual_length in full-precision.
Expected behavior
Always keep the KV cache for "residual_length" number of tokens in full-precision.