Does this work speed up the prefill phase

Hi @dingjingzhen ,

Thanks for your interest in our work. While Quest focus on boosting the efficiency of the decoding phase, the efficiency of prefill phase under long-context case is indeed a great issue, as explored by recent works like MInference. We believe the idea of query-aware sparsity can be easily extended to prefill phase for the following reasons:

Instead of saving memory movement, prefill can also be speed up via block-sparsity by reducing computation. Quest can be applied to tensor core operations.
Regarding to prefill phase with GQA, since basic shape of tensor op is 16x8x16 (M dimension 16). It is better to align 2 continuous query token to attend same "critical" tokens, so that 2 x Group_size can saturate 16 (M dimension). Therefore, we can use some aggregate op in estimating process to implement this.

mit-han-lab / Quest

Does this work speed up the prefill phase #5