[Question]: The speed of minference in short context

microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

https://aka.ms/MInference

MIT License

746 stars 32 forks source link

[Question]: The speed of minference in short context #33

Open maxin9966 opened 3 months ago

maxin9966 commented 3 months ago

Describe the issue

In the context of short sequences, does minference reduce inference speed or affect the throughput of vllm?

iofu728 commented 3 months ago

Hi @maxin9966, thank you for your question. Due to the overhead of approximate and sparse index build, the latency can be slightly higher than full-attention for scenarios with context sizes below 10k. You can find the details of latency benchmark results at minference-benchmark-experiments. You can control whether to use MInference based on the context size.