Open tanghl1994 opened 1 day ago
We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.
Q1: Whether the retrieval pattern holds in general query settings. Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Section.2 of our updated supplementary, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.
Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).
InfiniteBench | Original Model | RA Model |
---|---|---|
codedebug | 21.57 | 21.57 |
ensum | 30.71 | 30.58 |
endia | 19.50 | 19.50 |
enqa | 29.09 | 29.50 |
enmc | 63.31 | 63.31 |
Ruler | Original Model | RA Model |
---|---|---|
ruler_niah_single_1_16k | 100 | 100 |
ruler_niah_single_2_16k | 100 | 100 |
ruler_niah_single_3_16k | 100 | 100 |
ruler_niah_multikey_1_16k | 100 | 100 |
ruler_niah_multikey_2_16k | 99.11 | 99.11 |
ruler_niah_multikey_3_16k | 99.11 | 99.11 |
ruler_niah_multivalue_16k | 99.11 | 97.54 |
ruler_niah_multiquery_16k | 95.09 | 95.31 |
ruler_vt_16k | 80.54 | 86.61 |
ruler_fwe_16k | 89.29 | 85.42 |
ruler_cwe_16k | 90.09 | 90.27 |
ruler_qa_squad_16k | 88.39 | 88.39 |
ruler_qa_hotpotqa_16k | 56.25 | 56.25 |
ruler_16k | 92.07538462 | 92.15462 |
ruler_niah_single_1_32k | 100 | 100 |
ruler_niah_single_2_32k | 100 | 100 |
ruler_niah_single_3_32k | 100 | 100 |
ruler_niah_multikey_1_32k | 100 | 100 |
ruler_niah_multikey_2_32k | 96.43 | 96.43 |
ruler_niah_multikey_3_32k | 100 | 100 |
ruler_niah_multivalue_32k | 97.54 | 94.2 |
ruler_niah_multiquery_32k | 88.17 | 91.52 |
ruler_vt_32k | 87.86 | 93.04 |
ruler_fwe_32k | 88.69 | 88.99 |
ruler_cwe_32k | 12.41 | 20.45 |
ruler_qa_squad_32k | 86.61 | 85.71 |
ruler_qa_hotpotqa_32k | 46.43 | 46.43 |
ruler_32k | 84.93384615 | 85.90538 |
Q2: Comparison with SnapKV. We agree that SnapKV is a competitive approach for KV cache compression, especially when user queries are known before generation. Below, we present the results of SnapKV in both query-aware and query-agnostic settings: • Query-aware: SnapKV demonstrates impressive compression ratios and accuracy under known query scenarios. • Query-agnostic: In scenarios where the user query is not pre-defined, SnapKV’s performance deteriorates significantly, as shown by the following results of Needle in a Haystack (see Section.1 in our supplementary updated): | RazorAttention | SnapKV (query-aware) | SnapKV (query-agnostic) |
---|---|---|---|
98.33 | 100 | 69.75 |
This resilience is due to its training-free design and headwise sparse pattern, which ensure minimal information loss across various query types. We believe this discussion is crucial and will incorporate it into the revised version to emphasize the practical advantages of RazorAttention over query-dependent methods like SnapKV.
Here we also include the full results of RazorAttention and SnapKV on Longbench with Llama3.1-8B-instruct under the same setting with our paper.
Dataset | Baseline | RazorAttention | SnapKV |
---|---|---|---|
2wikimqa | 49.25 | 49.81 | 49.12 |
hotpotqa | 57.61 | 57.22 | 57.60 |
musique | 33.72 | 32.80 | 32.55 |
multifieldqa_en | 55.77 | 56.64 | 56.19 |
multifieldqa_zh | 63.47 | 63.81 | 62.99 |
narrativeqa | 29.23 | 29.54 | 30.03 |
qasper | 47.53 | 47.32 | 47.61 |
triviaqa | 91.50 | 91.20 | 91.50 |
gov_report | 34.58 | 33.08 | 32.97 |
qmsum | 25.27 | 25.10 | 25.37 |
vcsum | 17.28 | 16.99 | 16.49 |
dureader | 34.88 | 31.91 | 31.64 |
lcc | 24.68 | 24.62 | 24.64 |
repobench-p | 25.57 | 25.36 | 25.33 |
passage_retrieval_en | 99.50 | 100.00 | 99.50 |
passage_retrieval_zh | 90.45 | 95.98 | 90.45 |
passage_count | 10.08 | 9.75 | 9.83 |
trec | 14.50 | 9.25 | 17.67 |
lsht | 0.00 | 0.00 | 0.00 |
multi_news | 26.92 | 26.81 | 26.77 |
Samsum | 13.50 | 13.97 | 13.37 |
Q3: The retrieval pattern among different queries. As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.
Q4: The effectiveness of the compensation token. In Figure 7 of our main paper, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.
Q5: The KV cache budget for different algorithms. Table.2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.
ok
Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure.4 of our appendix, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.
Q1: Whether the retrieval pattern holds in general query settings Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure XXX, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs. Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).
Q2: Comparison with SnapKV xxx
Q3: The retrieval pattern among different queries As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.
Q4: The effectiveness of the compensation token In Figure 7, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.
Q5: The KV cache budget for different algorithms Table 2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.