R5 - Githubissues

tanghl1994 commented 2 days ago

Q1: Novelty of RazorAttention: Our work is the first to apply induction heads for KV cache compression, and we introduce two essential strategies that go beyond the use of induction heads to achieve lossless performance:

Expanded Definition of Retrieval Heads: The concept of induction heads was initially introduced in [1], where these heads were identified as attention heads that follow a token-retrieval pattern (attending to previously seen tokens). In [2], the authors extended this observation to long-context scenarios, showing that induction heads are more critical than other heads under such conditions. However, these works considered only induction heads as retrieval heads. Our analysis reveals that echo heads are equally crucial for retaining model performance. For instance, when tested on the Needle in a Haystack dataset, using the definition of retrieval heads from [1] leads to an approximate 10% accuracy drop. By redefining retrieval heads to include both induction and copy heads, we retain the full performance of the model (see Figure 5 in our paper).
Compensation Token Strategy: Directly discarding remote tokens in non-retrieval heads results in severe performance degradation, with accuracy drops exceeding 30%. To address this, we designed the compensation token strategy, which condenses the dropped information from these heads into a compact form. This strategy ensures that essential information is preserved while enabling efficient KV cache compression.

Q2: Analysis of the worse cases: We appreciate your observation regarding the benchmarks where RazorAttention does not achieve the best performance. From our analysis, we found that most of the challenging cases occur in summarization tasks. The performance loss in these tasks can be attributed to the model’s tendency to produce shorter answers when using RazorAttention. Since summarization tasks often rely on metrics like F1 score, which heavily penalize shorter responses, this behavior leads to lower scores.

Q3: What is the compression ratio for GQA in Llama3-8B-Instruct model? What is the compression ratio for 1024K sequence length?

Q4: Different combinations of the echo heads and induction heads: Yes, the performance of RazorAttention under different heads budget can be found in Figure.5 and Table.4 of our paper. Basically we noticed that including only 1% of the echo heads can essentially improve the accuracy while we need about 15% induction heads to fully recover the performance of the original model.

Molimolinaa commented 2 days ago

We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.

Q1: Novelty of RazorAttention.

Our work is the first to apply induction heads for KV cache compression, and we introduce two essential strategies that go beyond the use of induction heads to achieve lossless performance:

Expanded Definition of Retrieval Heads: The concept of induction heads was initially introduced in [1], where these heads were identified as attention heads that follow a token-retrieval pattern (attending to previously seen tokens). In [2], the authors extended this observation to long-context scenarios, showing that induction heads are more critical than other heads under such conditions. However, these works considered only induction heads as retrieval heads. Our analysis reveals that echo heads are equally crucial for retaining model performance. For instance, when tested on the Needle in a Haystack dataset, using the definition of retrieval heads from [1] leads to an approximate 10% accuracy drop. By redefining retrieval heads to include both induction and copy heads, we retain the full performance of the model (see Figure 5 in our paper).
Compensation Token Strategy: Directly discarding remote tokens in non-retrieval heads results in severe performance degradation, with accuracy drops exceeding 30%. To address this, we designed the compensation token strategy, which condenses the dropped information from these heads into a compact form. This strategy ensures that essential information is preserved while enabling efficient KV cache compression.

Q2: More comparison We also evaluated the accuracy of Snapkv at the same compression rate using LongBench on the Llama3.1-8B-instruct model.

	Baseline	RazorAttention	SnapKV
2wikimqa	49.25	49.81	49.12
hotpotqa	57.61	57.22	57.6
musique	33.72	32.8	32.55
multifieldqa_en	55.77	56.64	56.19
multifieldqa_zh	63.47	63.81	62.99
narrativeqa	29.23	29.54	30.03
qasper	47.53	47.32	47.61
triviaqa	91.5	91.2	91.5
gov_report	34.58	33.08	32.97
qmsum	25.27	25.1	25.37
vcsum	17.28	16.99	16.49
dureader	34.88	31.91	31.64
lcc	24.68	24.62	24.64
repobench-p	25.57	25.36	25.33
passage_retrieval_en	99.5	100	99.5
passage_retrieval_zh	90.45	95.98	90.45
passage_count	10.08	9.75	9.83
trec	14.5	9.25	17.67
lsht	0	0	0
multi_news	26.92	26.81	26.77
Samsum	13.5	13.97	13.37

Q3: Analysis of the worse cases.

We appreciate your observation regarding the benchmarks where RazorAttention does not achieve the best performance. From our analysis, we found that most of the challenging cases occur in summarization tasks. The performance loss in these tasks can be attributed to the model’s tendency to produce shorter answers when using RazorAttention. Since summarization tasks often rely on metrics like F1 score, which heavily penalize shorter responses, this behavior leads to lower scores.

Q4: What is the compression ratio for GQA in Llama3-8B-Instruct model.

For GQA models, we directly select 15% of the groups in the model (the score of each group is the sum of the heads within the group), therefore the compression ratio is the same with MHA models.

Q5: Different combinations of the echo heads and induction heads.

Yes, the performance of RazorAttention under different heads budget can be found in Figure.5 and Table.4 of our paper. Basically we noticed that including only 1% of the echo heads can essentially improve the accuracy while we need about 15% induction heads to fully recover the performance of the original model.

Molimolinaa commented 1 day ago

Q4补充下，其他ok

tanghl1994 / rebutall_huazi

R5 #9