mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
79 stars 5 forks source link

Why the critical token is dynamici n Quest while attention sink was claimed in StreamingLLLM? #3

Open Zhuohao-Li opened 1 week ago

Zhuohao-Li commented 1 week ago

Hi,

I was reading your paper and have a question about the "critical tokens". In Quest, the criticality of the tokens can change with different query tokens. While in StreamingLLM, the initial keys and values matter a lot in your assumption (attention sink). Why there is a difference between them?

Thanks!

happierpig commented 1 week ago

Hi @Zhuohao-Li ,

Thanks for your great question. My quick answer is:

  1. Attention sink is empirically the "most critical" token since most text generations only need local information.
  2. Only keeping attention sink and evicting all other tokens will let the model lose most context information. It still generates "normal output" but is less related to context. This is why StreamingLLM performs worst in long-context benchmarks(Figure 7).
  3. Quest aims to accurately keep context by maintaining all kv-cache. It uses query to estimate the critical tokens, which can be tokens in attention sink or other tokens. Indeed, experiments show that the first and last tokens are the most frequently selected, along with "context tokens".

Thanks!

Zhuohao-Li commented 6 days ago

Hi @happierpig

Thanks for your answers. It makes sense to me.

I got one more question about the eval, I found your scale of evaluation is not quite large (a 4090 for efficiency and an Ada 6000 for e2e eval), do you have any results on more larger scale with long-context serving? Do you ever think about the overheads of recomputing KV cache if processes miss.

Thanks!

happierpig commented 6 days ago

Hi @Zhuohao-Li ,

Thanks again for your great questions! First, I suppose 4090 or Ada6000 is enough to test kernel-level efficiency and demonstrate feasibility of Query-aware sparsity. Second, it is indeed an important discussion that whether Quest can be extended to large-scale serving scenarios (assuming by "large-scale" you are meaning multiple user setting).

A quick answer is yes. 1) Even with continuous batching, attention in decode phase is still memory-bound. Therefore query-aware sparsity works intuitively. 2) With GQA, all query heads within the same group should attend to same "critical" kv-tokens in order to utilize tensor core intrinsic. We have some preliminary results to support this argument.