vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.08k stars 3.11k forks source link

[Bug] [Speculative Decoding/flash_attn]: Flash attn backend crashes in speculative decoding #5288

Closed cadedaniel closed 3 weeks ago

cadedaniel commented 3 weeks ago

Your current environment

CI environment

🐛 Describe the bug

See https://github.com/vllm-project/vllm/pull/5286 and https://github.com/vllm-project/vllm/issues/5152

My guess is the way we encode multiple query tokens per sequence in an attention kernel invocation breaks the flash_attn contract somehow.

cadedaniel commented 3 weeks ago

actually I will close this in favor of https://github.com/vllm-project/vllm/issues/5152