vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.71k stars 3.91k forks source link

[Feature]: Add Sliding Window support to FlashInfer backend? #6899

Open noamgat opened 1 month ago

noamgat commented 1 month ago

🚀 The feature, motivation and pitch

FlashInfer v0.1.2 was just released with sliding window support:

https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.2

This should allow vLLM to add it, and achieve 8k context length with gemma2.

However, in FlashAttention's code I see the following block:

if sliding_window is not None:
            # NOTE(woosuk): flash-attn's sliding window does not work with
            # paged KV cache.
            raise ValueError(
                "Sliding window is not supported in FlashAttention.")

Despite FlashAttention supporting sliding window, vLLM's wrapper of flash attention does not. What is the conflict between sliding window and paged KV cache? Does this limitation mean that using it with FlashInfer is also not possible?

Alternatives

FlashAttention was recently updated with logits capping, so if vLLM's wrapper was updated to use it and enabled sliding window support as well, Gemma2 8k context would also be achievable in that route.

Additional context

No response

davidgxue commented 1 month ago

+1, this is preventing my mistral and phi 3 models from using FA2 :((. I think this wrapper line was added in a few months ago too.

tatiana-iazykova commented 2 weeks ago

@WoosukKwon back this too Most of my work is based on mistral architecture, and it would be nice to make use of vllm as well