Does KV cache belong to Activation?

Hi!

To be exact, "activations" refers to the "temporary activations" in our paper, which serve as the inputs of linear operators, and KV cache comes from the outputs of k_proj and v_proj.

While you may think both temporary activations and KV cache are the "feature maps" within the model, we empirically found that some of their characteristics fairly differ from each other (including their sensitivity towards quantization). So I think it's not a good idea to treat KV cache and the temporary activations as the same kind of tensors.

By the way, some recent works have also shown their finding about the difference between activations and KV cache, aligning with our observations. For example, WKVQuant also indicates the excessive sensitivity of temporary activations, compared with KV cache. Furthermore, KIVI's study on the data distribution of KV cache demostrates that the outlier patterns of KV cache and temporary activations are quite different, so it's not surprising the effect of quantization varies on these two kinds of tensors.

thu-nics / qllm-eval

Does KV cache belong to Activation? #6