vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.68k stars 4.08k forks source link

[Feature]: batched parallel decoding #4303

Open snyhlxde1 opened 5 months ago

snyhlxde1 commented 5 months ago

🚀 The feature, motivation and pitch

Parallel/Jacobi decoding improves inference efficiency by breaking the sequential nature of conventional auto-regressive decoding. Recent works [1, 2, 3] have found opportunities in this direction and efficiency improvements parallel decoding could bring.

Our team (@nanjiangwill, @Viol2000, @zhisbug) is interested in implementing this feature and supporting batched parallel decoding for efficient serving on vllm.

This could be a complementary feature to the speculative decoding features the vllm team is supporting:

  1. in high request-rate regime, using draft models/tree-based verification could introduce additional overhead that hurts serving latency.
  2. in such cases, batched parallel decoding could be adopted to bring consistently speedup without the need for draft models.
  3. Alternative to speculative decoding depending users' needs. Future implementations could also potentially bring the two together.
  4. Expedite research efforts in this general direction.

Alternatives

No response

Additional context

No response

cadedaniel commented 5 months ago

Would be awesome to have this in vLLM! I'm happy to discuss with y'all ways this could leverage the existing spec decode framework (if at all).

zhisbug commented 5 months ago

@snyhlxde1 could you specify what exactly you want to implement? This issue is very vague

snyhlxde1 commented 5 months ago

@zhisbug Current open-sourced projects don't support batched Jacobi decoding, where each query converge at varying rates and potentially with different window sizes/n-token sequence lengths.

Implementation would involve new worker and possibly changes to the scheduler and more. Item-by-item details haven't been discussed yet so it's not documented here.

zhisbug commented 5 months ago

I don't see this has a high value for vLLM users. Thoughts? @zhuohan123 @WoosukKwon

cadedaniel commented 5 months ago

For extremely low-latency use cases it is valuable for vLLM users (where flops allocation to Jacobi solving can outperform other speculative methods).

I am not sure at what point this can beat e.g. fine-tuned Medusa or Eagle heads, wdyt?

AaronFriel commented 3 months ago

@zhisbug why do you think this has low value for vLLM users? The recent papers seem promising. Implementing this technique in vLLM would prove whether the recent developments in an open source, testable environment.

If nothing else, supposing this implementation doesn't improve upon other methods, having more, varied speculative decoding mechanisms in vLLM with different tradeoffs aids users and other researchers in understanding what the current Pareto frontier is.

zhisbug commented 3 months ago

In general, we can say vllm is a framework normally addressing online, high-throughput llm serving with large bs.

Those parallel decoding methods are normally only beneficial at small bs (e.g., bs =1).

Also, there is a non-trivial effort to integrate these parallel decoding methods into the current architecture of vllm due to the nature of these two use cases. We might need to do some substantial refactoring in vllm to allow a special path.

There is another ongoing thread (https://github.com/vllm-project/vllm/issues/4565) looking into this but it is still unclear how to do that. Careful design is needed

That being said, my point is: