mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
128 stars 5 forks source link

Does this work speed up the prefill phase #5

Closed dingjingzhen closed 2 weeks ago

dingjingzhen commented 1 month ago

Great job,but It seems that this article does not speed up the prefill phase, which in many cases takes a long time in the reasoning process of long text

happierpig commented 1 month ago

Hi @dingjingzhen ,

Thanks for your interest in our work. While Quest focus on boosting the efficiency of the decoding phase, the efficiency of prefill phase under long-context case is indeed a great issue, as explored by recent works like MInference. We believe the idea of query-aware sparsity can be easily extended to prefill phase for the following reasons:

  1. Instead of saving memory movement, prefill can also be speed up via block-sparsity by reducing computation. Quest can be applied to tensor core operations.
  2. Regarding to prefill phase with GQA, since basic shape of tensor op is 16x8x16 (M dimension 16). It is better to align 2 continuous query token to attend same "critical" tokens, so that 2 x Group_size can saturate 16 (M dimension). Therefore, we can use some aggregate op in estimating process to implement this.