The behavior of QuantizedCache is not consistent with KIVI

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Apache License 2.0

133.52k stars 26.68k forks source link

System Info

transformers 4.45.2

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

As defined in L693-L719 in transformers/cache_utils.py, QuantizedCache will quantize all tokens in the prefill phase (L693-L698) without keeping recent tokens in full-precision (residual_length). Also, during the decoding phase, when self.key_cache[layer_idx].shape[-2] + 1 >= self.residual_length, all tokens are quantized without exception. I think KIVI uses a sliding window and always keeps recent tokens of residual_length in full-precision.

Expected behavior

Always keep the KV cache for "residual_length" number of tokens in full-precision.

Hey @ThisisBillhe !

Yes, the current implementation is not really same as KIVI. As you noted we don''t have have sliding window and we don't use the same quantization as folks from KIVI. KIVI has support for concatenating quantized tensors and for matmuls, while what we do is simply quantize using any open library as backend, and then dequantize it back before concatenating with full precision key/values

The quantized cache in transformers is rather inspired by the idea from KIVI paper, so it's not a full copy. At the same time it gives flexibility to switch up _quantize and _dequantize methods for users to try out different quantization frameworks/methods/params. Improving the cache in terms of latency was an options but unfortunately I am short on bandwidth. Feel free to open a PR to make the cache store key/values in sliding window fashion, if you want to give it a try 😄

huggingface / transformers