Open noamgat opened 3 months ago
+1, this is preventing my mistral and phi 3 models from using FA2 :((. I think this wrapper line was added in a few months ago too.
@WoosukKwon back this too Most of my work is based on mistral architecture, and it would be nice to make use of vllm as well
+1 any update on this ?
🚀 The feature, motivation and pitch
FlashInfer v0.1.2 was just released with sliding window support:
https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.2
This should allow vLLM to add it, and achieve 8k context length with gemma2.
However, in FlashAttention's code I see the following block:
Despite FlashAttention supporting sliding window, vLLM's wrapper of flash attention does not. What is the conflict between sliding window and paged KV cache? Does this limitation mean that using it with FlashInfer is also not possible?
Alternatives
FlashAttention was recently updated with logits capping, so if vLLM's wrapper was updated to use it and enabled sliding window support as well, Gemma2 8k context would also be achievable in that route.
Additional context
No response