Open maxin9966 opened 3 months ago
Hi @maxin9966, thank you for your question. Due to the overhead of approximate and sparse index build, the latency can be slightly higher than full-attention for scenarios with context sizes below 10k. You can find the details of latency benchmark results at minference-benchmark-experiments. You can control whether to use MInference based on the context size.
Describe the issue
In the context of short sequences, does minference reduce inference speed or affect the throughput of vllm?