vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.56k stars 3.89k forks source link

[Feature]: Combine pipeline parallelism with speculative decoding #6911

Open cadedaniel opened 1 month ago

cadedaniel commented 1 month ago

🚀 The feature, motivation and pitch

We can combine pipeline parallelism with speculative decoding to get latency reductions, especially when serving Llama 405b over two nodes.

The speculative decoding framework design should support pipeline parallelism by wrapping normal workers inside the speculative decode worker. But it needs to be tried / issues ironed out.

Alternatives

No response

Additional context

No response

jiqing-feng commented 3 weeks ago

Hi @cadedaniel . Does the pipeline mean draft model and target model? Like we have 2 stages, stage 1 only runs draft model, stage 2 only runs target model.