[Feature] [Spec decode]: Combine chunked prefill with speculative decoding

cadedaniel commented 4 months ago

🚀 The feature, motivation and pitch

Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.

This is a complex task that requires some design, if you're interested please reach out.

Alternatives

No response

Additional context

cc @LiuXiaoxuanPKU @comaniac @rkooo567

Dbxwz commented 3 months ago

Hello, you mentioned optimizations for scoring time in #4630

P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call

I think multi-query attention kernel is not equal to MQA here, it is more like the append stage in flashinfer, am I right? And I notice that the calculation process of append is similar to that of chunked prefill's one step. So I use chunked prefill to implement the AppendTop1Scorer which get a 10% speedup compared to BatchExpansionTop1Scorer. It's a dirty solution, since I create a new SequenceGroupMetadata which change the scoring sequence to a chunked prefill sequence. This implementation conflicts with recompute and chunked prefille.

So the perfect implementation should be that ModelRunner and Backend support the append stage, Backend should already support it if it supports chunked prefill.

In addition, is this issue about solving the scheduling problem of speculative decoding? Can you give a detailed introduction to what needs to be done in this issue?

cadedaniel commented 3 months ago

That's awesome. You should chat with @LiuXiaoxuanPKU who is removing batch expansion from vLLM.

FYI this issue is about combining the ITL improvements obtained from chunked prefill scheduling with spec decode.

NickLucche commented 1 week ago

I can look into this

vllm-project / vllm