[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

chizhang118 commented 6 months ago

🚀 The feature, motivation and pitch

This paper might be of interest: https://arxiv.org/pdf/2306.14048.pdf

This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times. When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.

Trade-off: Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.

@simon-mo Is this a feature you'd like to see implemented?

Alternatives

No response

Additional context

No response

ChuanhongLi commented 6 months ago

H2O needs attention score to decide which one need to be evicted. No matter the prompt processing(xops.memory_efficient_attention_forward, vllm = 0.2.7 ) or the decoding phase, it is not easy to get the attention score. Has any idea?

laneeeee commented 6 months ago

very interesting idea! the paper adopts wiki-text-103 as test dataset to get the conclusion, i think it will may be annother conclusion in pure mathematical formula dataset

beagleski commented 6 months ago

There's a similar feature request for StreamingLLM, issue here. Meanwhile, FastGen and Scissorhands are also highly related KV cache compression methods. It would be better if the implementation design can be more general, e.g., flexible head-wise/layer-wise KV cache management @simon-mo @chizhang118 @WoosukKwon.

PiotrNawrot commented 6 months ago

Another promising KV Cache Compression method (this time learned).

vllm-project / vllm