R6 - Githubissues

We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.

Q1: Comparison with [1]: Although [1] does provide a similar finding as ours, directly retaining the KV cache in induction heads introduces a significant accuracy drop for tasks such as Needle In A Haystack. We introduce two essential strategies that go beyond the use of induction heads to achieve lossless performance:

Expanded Definition of Retrieval Heads: The concept of induction heads was initially introduced in [1], where these heads were identified as attention heads that follow a token-retrieval pattern (attending to previously seen tokens). In [2], the authors extended this observation to long-context scenarios, showing that induction heads are more critical than other heads under such conditions. However, these works considered only induction heads as retrieval heads. Our analysis reveals that echo heads are equally crucial for retaining model performance. For instance, when tested on the Needle in a Haystack dataset, using the definition of retrieval heads from [1] leads to an approximate 10% accuracy drop. By redefining retrieval heads to include both induction and copy heads, we retain the full performance of the model (see Figure 5 in our paper).
Compensation Token Strategy: Directly discarding remote tokens in non-retrieval heads results in severe performance degradation, with accuracy drops exceeding 30%. To address this, we designed the compensation token strategy, which condenses the dropped information from these heads into a compact form. This strategy ensures that essential information is preserved while enabling efficient KV cache compression.

Q2: Comparison with [2]. An obvious advantage of the training-free algorithm is that it can be used as a plug-and-play component for advanced LLM serving system, such as TRT-LLM and Triton, and training-free lossless compression is usually a much more challenging task compared to the training-based algorithms. Moreover, we believe the compensation token we introduced and our head selection method (inductive + echo heads) provide valuable insights into the underlying functionality of LLMs in long-context scenarios.

Q3: How to quickly determine the suitable proportion of Induction Heads and Echo Heads for different models. We find that for most open-source models, about 15%~20% induction heads+1% echo heads shall be enough without training. However, if there still exists certain performance loss, our receipe is to increase about 3% of the induction heads at one step. We will provide a discussion section about this in our revised version.

Q4: General theoretical analysis explaining why retrieval heads are widely present in various models: We believe this is a quite important and challenging topic, and we will leave this for future study.

tanghl1994 / rebutall_huazi

R6 #10