[Bug] [Speculative Decoding/flash_attn]: Flash attn backend crashes in speculative decoding

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

22.08k stars 3.11k forks source link

Closed cadedaniel closed 3 weeks ago

cadedaniel commented 3 weeks ago

CI environment

My guess is the way we encode multiple query tokens per sequence in an attention kernel invocation breaks the flash_attn contract somehow.

cadedaniel commented 3 weeks ago