microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
776 stars 36 forks source link

[Feature Request]: Support LLaVA Model feature request / Low generation speed #74

Open ThisisBillhe opened 1 month ago

ThisisBillhe commented 1 month ago

Is your feature request related to a problem? Please describe.

I use LLaVA from its official repo and search pattern with an input sample. However, the GPU-Util and the generation speed are slow (GPU utilization around 17%). Is it relevant to short sequence length? Moreover, can we search pattern with a search space of fewer vertical and diagonal lines?

Describe the solution you'd like

No response

Additional context

No response

iofu728 commented 1 month ago

Hi @ThisisBillhe, thank you for your suggestion and support. This is already part of our ongoing research plan, and we're striving to release the related content as soon as possible.