vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.91k stars 3.95k forks source link

[Feature]: Hybrid Attention #6323

Open leo6022 opened 2 months ago

leo6022 commented 2 months ago

🚀 The feature, motivation and pitch

Some models(Gemma2 ...) uses hybrid attention, global-attention + local-attention. But vllm currently ignores local-attn and uses global-attn.

By simply setting the window size, we can use local-attention in vllm/flashattn. This can accelerate the prefill phase (on long context case). In decode phase, the local-attention layer caches all the kvcache and then only compute the window size. But that does not reduce the kvcache consumption.

So is there any plan to optimize the problem? Or development advice.

Alternatives

No response

Additional context

No response

simon-mo commented 2 months ago

The main blocks is managing the KV cache properly with this scheme. Currently, our paged kv cache system (BlockTableV2) doesn't support hybrid attention.

leo6022 commented 2 months ago

The main blocks is managing the KV cache properly with this scheme. Currently, our paged kv cache system (BlockTableV2) doesn't support hybrid attention.

Is there a plan to support it? Or give some advice on how to develop it in vllm.

simon-mo commented 2 months ago

The pointers are:

leo6022 commented 2 months ago

@simon-mo One problem is how to allocate the num_blocks between global-layer and local-layer. I think there are an imbalnace problem in cache.