mit-han-lab / Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
153 stars 8 forks source link

Support for bsz>1 and GQA #8

Open Ryanuppp opened 1 month ago

Ryanuppp commented 1 month ago

Great job! We found that Quest is implemented on the previous version of flashinfer and some common feature are not support currently.

happierpig commented 4 weeks ago

Hi @Ryanuppp ,

We are exploring a follow-up work after Quest, in which we plan to update these features.

wjj19950828 commented 3 days ago

@happierpig Is GQA currently supported? The mainstream architectures, such as llama2/3 70b, are mainly GQA. It is very important to support this feature.