[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Is your feature request related to a problem? Please describe.
I use LLaVA from its official repo and search pattern with an input sample. However, the GPU-Util and the generation speed are slow (GPU utilization around 17%). Is it relevant to short sequence length? Moreover, can we search pattern with a search space of fewer vertical and diagonal lines?
Hi @ThisisBillhe, thank you for your suggestion and support. This is already part of our ongoing research plan, and we're striving to release the related content as soon as possible.
Is your feature request related to a problem? Please describe.
I use LLaVA from its official repo and search pattern with an input sample. However, the GPU-Util and the generation speed are slow (GPU utilization around 17%). Is it relevant to short sequence length? Moreover, can we search pattern with a search space of fewer vertical and diagonal lines?
Describe the solution you'd like
No response
Additional context
No response