[Question]: Hope to supplement the situation of increasing HBM usage with the context.

Arcmoon-Hu commented 2 weeks ago

Describe the issue

In fact, there are currently numerous works that expand the context, but as the context expands, the KV cache increases, leading to a sharp rise in HBM usage. Therefore, whether this work can save a significant amount of HBM.

iofu728 commented 2 weeks ago

Hi @Arcmoon-Hu, thanks for your support.

MInference 1.0 focuses on speeding up the pre-filling stage of Long-context LLMs inference, reducing the time from 30 minutes to 3 minutes for 1M tokens on an A100. This work does not address the KV cache storage issue. Future work on MInference will include solutions to reduce KV cache memory overhead.

Additionally, there are several research focused on KV cache compression (like H20, SnapKV) and KV cache quantization (KIVI). You might consider using these solutions.

Arcmoon-Hu commented 2 weeks ago

Ok，thank you for your response

microsoft / MInference

[Question]: Hope to supplement the situation of increasing HBM usage with the context. #7

Describe the issue