Open tanghl1994 opened 2 days ago
We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.
Q1: Novelty of RazorAttention.
Our work is the first to apply induction heads for KV cache compression, and we introduce two essential strategies that go beyond the use of induction heads to achieve lossless performance:
Q2: More comparison We also evaluated the accuracy of Snapkv at the same compression rate using LongBench on the Llama3.1-8B-instruct model.
Baseline | RazorAttention | SnapKV | |
---|---|---|---|
2wikimqa | 49.25 | 49.81 | 49.12 |
hotpotqa | 57.61 | 57.22 | 57.6 |
musique | 33.72 | 32.8 | 32.55 |
multifieldqa_en | 55.77 | 56.64 | 56.19 |
multifieldqa_zh | 63.47 | 63.81 | 62.99 |
narrativeqa | 29.23 | 29.54 | 30.03 |
qasper | 47.53 | 47.32 | 47.61 |
triviaqa | 91.5 | 91.2 | 91.5 |
gov_report | 34.58 | 33.08 | 32.97 |
qmsum | 25.27 | 25.1 | 25.37 |
vcsum | 17.28 | 16.99 | 16.49 |
dureader | 34.88 | 31.91 | 31.64 |
lcc | 24.68 | 24.62 | 24.64 |
repobench-p | 25.57 | 25.36 | 25.33 |
passage_retrieval_en | 99.5 | 100 | 99.5 |
passage_retrieval_zh | 90.45 | 95.98 | 90.45 |
passage_count | 10.08 | 9.75 | 9.83 |
trec | 14.5 | 9.25 | 17.67 |
lsht | 0 | 0 | 0 |
multi_news | 26.92 | 26.81 | 26.77 |
Samsum | 13.5 | 13.97 | 13.37 |
Q3: Analysis of the worse cases.
We appreciate your observation regarding the benchmarks where RazorAttention does not achieve the best performance. From our analysis, we found that most of the challenging cases occur in summarization tasks. The performance loss in these tasks can be attributed to the model’s tendency to produce shorter answers when using RazorAttention. Since summarization tasks often rely on metrics like F1 score, which heavily penalize shorter responses, this behavior leads to lower scores.
Q4: What is the compression ratio for GQA in Llama3-8B-Instruct model.
For GQA models, we directly select 15% of the groups in the model (the score of each group is the sum of the heads within the group), therefore the compression ratio is the same with MHA models.
Q5: Different combinations of the echo heads and induction heads.
Yes, the performance of RazorAttention under different heads budget can be found in Figure.5 and Table.4 of our paper. Basically we noticed that including only 1% of the echo heads can essentially improve the accuracy while we need about 15% induction heads to fully recover the performance of the original model.
Q4补充下,其他ok
Q1: Novelty of RazorAttention: Our work is the first to apply induction heads for KV cache compression, and we introduce two essential strategies that go beyond the use of induction heads to achieve lossless performance:
Q2: Analysis of the worse cases: We appreciate your observation regarding the benchmarks where RazorAttention does not achieve the best performance. From our analysis, we found that most of the challenging cases occur in summarization tasks. The performance loss in these tasks can be attributed to the model’s tendency to produce shorter answers when using RazorAttention. Since summarization tasks often rely on metrics like F1 score, which heavily penalize shorter responses, this behavior leads to lower scores.
Q3: What is the compression ratio for GQA in Llama3-8B-Instruct model? What is the compression ratio for 1024K sequence length?
Q4: Different combinations of the echo heads and induction heads: Yes, the performance of RazorAttention under different heads budget can be found in Figure.5 and Table.4 of our paper. Basically we noticed that including only 1% of the echo heads can essentially improve the accuracy while we need about 15% induction heads to fully recover the performance of the original model.