R7 - Githubissues

tanghl1994 commented 1 day ago

Q1: Whether the retrieval pattern holds in general query settings Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure XXX, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs. Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).

Q2: Comparison with SnapKV xxx

Q3: The retrieval pattern among different queries As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.

Q4: The effectiveness of the compensation token In Figure 7, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.

Q5: The KV cache budget for different algorithms Table 2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.

Molimolinaa commented 1 day ago

We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.

Q1: Whether the retrieval pattern holds in general query settings. Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Section.2 of our updated supplementary, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.

Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).

InfiniteBench	Original Model	RA Model
codedebug	21.57	21.57
ensum	30.71	30.58
endia	19.50	19.50
enqa	29.09	29.50
enmc	63.31	63.31

Ruler	Original Model	RA Model
ruler_niah_single_1_16k	100	100
ruler_niah_single_2_16k	100	100
ruler_niah_single_3_16k	100	100
ruler_niah_multikey_1_16k	100	100
ruler_niah_multikey_2_16k	99.11	99.11
ruler_niah_multikey_3_16k	99.11	99.11
ruler_niah_multivalue_16k	99.11	97.54
ruler_niah_multiquery_16k	95.09	95.31
ruler_vt_16k	80.54	86.61
ruler_fwe_16k	89.29	85.42
ruler_cwe_16k	90.09	90.27
ruler_qa_squad_16k	88.39	88.39
ruler_qa_hotpotqa_16k	56.25	56.25
ruler_16k	92.07538462	92.15462
ruler_niah_single_1_32k	100	100
ruler_niah_single_2_32k	100	100
ruler_niah_single_3_32k	100	100
ruler_niah_multikey_1_32k	100	100
ruler_niah_multikey_2_32k	96.43	96.43
ruler_niah_multikey_3_32k	100	100
ruler_niah_multivalue_32k	97.54	94.2
ruler_niah_multiquery_32k	88.17	91.52
ruler_vt_32k	87.86	93.04
ruler_fwe_32k	88.69	88.99
ruler_cwe_32k	12.41	20.45
ruler_qa_squad_32k	86.61	85.71
ruler_qa_hotpotqa_32k	46.43	46.43
ruler_32k	84.93384615	85.90538

Q2: Comparison with SnapKV. We agree that SnapKV is a competitive approach for KV cache compression, especially when user queries are known before generation. Below, we present the results of SnapKV in both query-aware and query-agnostic settings: • Query-aware: SnapKV demonstrates impressive compression ratios and accuracy under known query scenarios. • Query-agnostic: In scenarios where the user query is not pre-defined, SnapKV’s performance deteriorates significantly, as shown by the following results of Needle in a Haystack (see Section.1 in our supplementary updated):	RazorAttention	SnapKV (query-aware)	SnapKV (query-agnostic)
98.33	100	69.75

This resilience is due to its training-free design and headwise sparse pattern, which ensure minimal information loss across various query types. We believe this discussion is crucial and will incorporate it into the revised version to emphasize the practical advantages of RazorAttention over query-dependent methods like SnapKV.

Here we also include the full results of RazorAttention and SnapKV on Longbench with Llama3.1-8B-instruct under the same setting with our paper.

Dataset	Baseline	RazorAttention	SnapKV
2wikimqa	49.25	49.81	49.12
hotpotqa	57.61	57.22	57.60
musique	33.72	32.80	32.55
multifieldqa_en	55.77	56.64	56.19
multifieldqa_zh	63.47	63.81	62.99
narrativeqa	29.23	29.54	30.03
qasper	47.53	47.32	47.61
triviaqa	91.50	91.20	91.50
gov_report	34.58	33.08	32.97
qmsum	25.27	25.10	25.37
vcsum	17.28	16.99	16.49
dureader	34.88	31.91	31.64
lcc	24.68	24.62	24.64
repobench-p	25.57	25.36	25.33
passage_retrieval_en	99.50	100.00	99.50
passage_retrieval_zh	90.45	95.98	90.45
passage_count	10.08	9.75	9.83
trec	14.50	9.25	17.67
lsht	0.00	0.00	0.00
multi_news	26.92	26.81	26.77
Samsum	13.50	13.97	13.37

Q3: The retrieval pattern among different queries. As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.

Q4: The effectiveness of the compensation token. In Figure 7 of our main paper, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.

Q5: The KV cache budget for different algorithms. Table.2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.

Molimolinaa commented 19 hours ago

ok

Molimolinaa commented 16 hours ago

retrieal_head with different input

Molimolinaa commented 16 hours ago

Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure.4 of our appendix, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.

tanghl1994 / rebutall_huazi

R7 #11