tanghl1994 / rebutall_huazi

0 stars 0 forks source link

R7 #11

Open tanghl1994 opened 1 day ago

tanghl1994 commented 1 day ago

Q1: Whether the retrieval pattern holds in general query settings Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure XXX, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs. Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).

Q2: Comparison with SnapKV xxx

Q3: The retrieval pattern among different queries As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.

Q4: The effectiveness of the compensation token In Figure 7, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.

Q5: The KV cache budget for different algorithms Table 2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.

Molimolinaa commented 1 day ago

We sincerely appreciate your constructive feedback of our paper, and we will make certain updates w.r.t. your concerns in our revised version. Below are our responses to your concerns.

Q1: Whether the retrieval pattern holds in general query settings. Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Section.2 of our updated supplementary, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.

Furthermore, we validate these findings on diverse and complex datasets such as LongBench (Table 3) and InfiniBench/Ruler (results provided below). These benchmarks encompass a wide variety of complex task sources, and RazorAttention successfully retains the original performance. This consistency reinforces the generalizability of the observed retrieval patterns, particularly for lengthy inputs (noting that KV cache compression is not applied to inputs shorter than 4k, as described in Table 2).

InfiniteBench Original Model RA Model
codedebug 21.57 21.57
ensum 30.71 30.58
endia 19.50 19.50
enqa 29.09 29.50
enmc 63.31 63.31
Ruler Original Model RA Model
ruler_niah_single_1_16k 100 100
ruler_niah_single_2_16k 100 100
ruler_niah_single_3_16k 100 100
ruler_niah_multikey_1_16k 100 100
ruler_niah_multikey_2_16k 99.11 99.11
ruler_niah_multikey_3_16k 99.11 99.11
ruler_niah_multivalue_16k 99.11 97.54
ruler_niah_multiquery_16k 95.09 95.31
ruler_vt_16k 80.54 86.61
ruler_fwe_16k 89.29 85.42
ruler_cwe_16k 90.09 90.27
ruler_qa_squad_16k 88.39 88.39
ruler_qa_hotpotqa_16k 56.25 56.25
ruler_16k 92.07538462 92.15462
ruler_niah_single_1_32k 100 100
ruler_niah_single_2_32k 100 100
ruler_niah_single_3_32k 100 100
ruler_niah_multikey_1_32k 100 100
ruler_niah_multikey_2_32k 96.43 96.43
ruler_niah_multikey_3_32k 100 100
ruler_niah_multivalue_32k 97.54 94.2
ruler_niah_multiquery_32k 88.17 91.52
ruler_vt_32k 87.86 93.04
ruler_fwe_32k 88.69 88.99
ruler_cwe_32k 12.41 20.45
ruler_qa_squad_32k 86.61 85.71
ruler_qa_hotpotqa_32k 46.43 46.43
ruler_32k 84.93384615 85.90538
Q2: Comparison with SnapKV. We agree that SnapKV is a competitive approach for KV cache compression, especially when user queries are known before generation. Below, we present the results of SnapKV in both query-aware and query-agnostic settings: • Query-aware: SnapKV demonstrates impressive compression ratios and accuracy under known query scenarios. • Query-agnostic: In scenarios where the user query is not pre-defined, SnapKV’s performance deteriorates significantly, as shown by the following results of Needle in a Haystack (see Section.1 in our supplementary updated): RazorAttention SnapKV (query-aware) SnapKV (query-agnostic)
98.33 100 69.75

This resilience is due to its training-free design and headwise sparse pattern, which ensure minimal information loss across various query types. We believe this discussion is crucial and will incorporate it into the revised version to emphasize the practical advantages of RazorAttention over query-dependent methods like SnapKV.

Here we also include the full results of RazorAttention and SnapKV on Longbench with Llama3.1-8B-instruct under the same setting with our paper.

Dataset Baseline RazorAttention SnapKV
2wikimqa 49.25 49.81 49.12
hotpotqa 57.61 57.22 57.60
musique 33.72 32.80 32.55
multifieldqa_en 55.77 56.64 56.19
multifieldqa_zh 63.47 63.81 62.99
narrativeqa 29.23 29.54 30.03
qasper 47.53 47.32 47.61
triviaqa 91.50 91.20 91.50
gov_report 34.58 33.08 32.97
qmsum 25.27 25.10 25.37
vcsum 17.28 16.99 16.49
dureader 34.88 31.91 31.64
lcc 24.68 24.62 24.64
repobench-p 25.57 25.36 25.33
passage_retrieval_en 99.50 100.00 99.50
passage_retrieval_zh 90.45 95.98 90.45
passage_count 10.08 9.75 9.83
trec 14.50 9.25 17.67
lsht 0.00 0.00 0.00
multi_news 26.92 26.81 26.77
Samsum 13.50 13.97 13.37

Q3: The retrieval pattern among different queries. As mentioned in response to Q1, the retrieval patterns are highly consistent across different input queries, indicating they are primarily model-based and not query-specific.

Q4: The effectiveness of the compensation token. In Figure 7 of our main paper, we present ablation studies examining the impact of compensation tokens. Both experiments were conducted with identical KV cache allocations, and the inclusion of compensation tokens resulted in significant accuracy improvements. This evidence highlights the critical role of compensation tokens in mitigating information loss due to KV cache reduction. Exploring the combination of compensation tokens with other baseline algorithms is an exciting direction, and we plan to investigate this in future studies.

Q5: The KV cache budget for different algorithms. Table.2 outlines the KV cache compression setup for RazorAttention. For a fair comparison, the total compression ratio for each baseline was adjusted to match RazorAttention’s configuration, ensuring consistent evaluation across varying input lengths. However, as the input length increases, the absolute KV cache budget may differ proportionally due to RazorAttention’s dynamic compression strategy.

Molimolinaa commented 19 hours ago

ok

Molimolinaa commented 16 hours ago

retrieal_head with different input

Molimolinaa commented 16 hours ago

Thank you for pointing this out. The retrieval patterns remain highly consistent across different input queries, suggesting that they are indeed model-based rather than query-specific. In Figure.4 of our appendix, we illustrate the retrieval heads selected using various “Needle in a Haystack” queries, demonstrating that the selected heads are unchanged across different inputs.