[Feature]: Hybrid Attention

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

26.91k stars 3.95k forks source link

[Feature]: Hybrid Attention #6323

Open leo6022 opened 2 months ago

leo6022 commented 2 months ago

🚀 The feature, motivation and pitch

Some models(Gemma2 ...) uses hybrid attention, global-attention + local-attention. But vllm currently ignores local-attn and uses global-attn.

By simply setting the window size, we can use local-attention in vllm/flashattn. This can accelerate the prefill phase (on long context case). In decode phase, the local-attention layer caches all the kvcache and then only compute the window size. But that does not reduce the kvcache consumption.

So is there any plan to optimize the problem? Or development advice.

Alternatives

No response

Additional context

No response

simon-mo commented 2 months ago

The main blocks is managing the KV cache properly with this scheme. Currently, our paged kv cache system (BlockTableV2) doesn't support hybrid attention.

leo6022 commented 2 months ago

The main blocks is managing the KV cache properly with this scheme. Currently, our paged kv cache system (BlockTableV2) doesn't support hybrid attention.

Is there a plan to support it? Or give some advice on how to develop it in vllm.

simon-mo commented 2 months ago

The pointers are:

Current partially sliding window is implemented in BlockManagerV2 https://github.com/vllm-project/vllm/blob/d59eb98489103877e9476ef5263305aa3e3f9e23/vllm/core/block_manager_v2.py#L28-L30
However, it assume each layer have exactly the same number of blocks. You need to start making your change here.
After that, the necessary information need to ensure to be properly propagated via block_tables data structure that's eventually passed into the PagedAttentionV2 kernel.
Ideally you should only need to change block manager. But you might end up finding needing to change models or kernels along the way

leo6022 commented 2 months ago

@simon-mo One problem is how to allocate the num_blocks between global-layer and local-layer. I think there are an imbalnace problem in cache.