microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
571 stars 20 forks source link

[Question]: Hope to supplement the situation of increasing HBM usage with the context. #7

Closed Arcmoon-Hu closed 2 weeks ago

Arcmoon-Hu commented 2 weeks ago

Describe the issue

In fact, there are currently numerous works that expand the context, but as the context expands, the KV cache increases, leading to a sharp rise in HBM usage. Therefore, whether this work can save a significant amount of HBM.

iofu728 commented 2 weeks ago

Hi @Arcmoon-Hu, thanks for your support.

MInference 1.0 focuses on speeding up the pre-filling stage of Long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.

Additionally, there are several research focused on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.

Arcmoon-Hu commented 2 weeks ago

Ok,thank you for your response