Closed Arcmoon-Hu closed 2 weeks ago
Hi @Arcmoon-Hu, thanks for your support.
MInference 1.0 focuses on speeding up the pre-filling stage of Long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.
Additionally, there are several research focused on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.
Ok,thank you for your response
Describe the issue
In fact, there are currently numerous works that expand the context, but as the context expands, the KV cache increases, leading to a sharp rise in HBM usage. Therefore, whether this work can save a significant amount of HBM.