Open Zhuohao-Li opened 1 week ago
Hi @Zhuohao-Li ,
Thanks for your great question. My quick answer is:
Thanks!
Hi @happierpig
Thanks for your answers. It makes sense to me.
I got one more question about the eval, I found your scale of evaluation is not quite large (a 4090 for efficiency and an Ada 6000 for e2e eval), do you have any results on more larger scale with long-context serving? Do you ever think about the overheads of recomputing KV cache if processes miss.
Thanks!
Hi @Zhuohao-Li ,
Thanks again for your great questions! First, I suppose 4090 or Ada6000 is enough to test kernel-level efficiency and demonstrate feasibility of Query-aware sparsity. Second, it is indeed an important discussion that whether Quest can be extended to large-scale serving scenarios (assuming by "large-scale" you are meaning multiple user setting).
A quick answer is yes. 1) Even with continuous batching, attention in decode phase is still memory-bound. Therefore query-aware sparsity works intuitively. 2) With GQA, all query heads within the same group should attend to same "critical" kv-tokens in order to utilize tensor core intrinsic. We have some preliminary results to support this argument.
Hi,
I was reading your paper and have a question about the "critical tokens". In Quest, the criticality of the tokens can change with different query tokens. While in StreamingLLM, the initial keys and values matter a lot in your assumption (attention sink). Why there is a difference between them?
Thanks!