thu-nics / qllm-eval

Code Repository of Evaluating Quantized Large Language Models
MIT License
86 stars 4 forks source link

Does KV cache belong to Activation? #6

Open pprp opened 4 months ago

pprp commented 4 months ago

The survey discusses the sensitivity of activation quantization and the tolerance of KV cache quantization in the context of post-training quantization (PTQ) for large language models (LLMs). It makes the distinction that while activation quantization is quite sensitive (meaning it can significantly affect performance if not handled carefully), KV cache quantization is more tolerant (implying it can be quantized with less impact on performance).

My question is: Whether KV cache should be considered part of the activation.

wln20 commented 4 months ago

Hi!

To be exact, "activations" refers to the "temporary activations" in our paper, which serve as the inputs of linear operators, and KV cache comes from the outputs of k_proj and v_proj.

While you may think both temporary activations and KV cache are the "feature maps" within the model, we empirically found that some of their characteristics fairly differ from each other (including their sensitivity towards quantization). So I think it's not a good idea to treat KV cache and the temporary activations as the same kind of tensors.

By the way, some recent works have also shown their finding about the difference between activations and KV cache, aligning with our observations. For example, WKVQuant also indicates the excessive sensitivity of temporary activations, compared with KV cache. Furthermore, KIVI's study on the data distribution of KV cache demostrates that the outlier patterns of KV cache and temporary activations are quite different, so it's not surprising the effect of quantization varies on these two kinds of tensors.