vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.11k stars 4.73k forks source link

[Feature]: sliding window attention for odd layers #7442

Open tanliboy opened 3 months ago

tanliboy commented 3 months ago

🚀 The feature, motivation and pitch

Thanks for fixing the soft-capping issue of the Gemma 2 models in the last release! I noticed there's still a comment and a warning when serving Gemma 2 models.

Are there any plans to support sliding window attention for odd layers? Additionally, do we have any benchmarks on the performance impact of not using sliding windows on these layers? Cc @WoosukKwon

Alternatives

No response

Additional context

No response

griff4692 commented 1 month ago

Is this on the roadmap?

It will be an issue with the new ministral models now. I see that the warning has been updated based on new mistral models