We can combine pipeline parallelism with speculative decoding to get latency reductions, especially when serving Llama 405b over two nodes.
The speculative decoding framework design should support pipeline parallelism by wrapping normal workers inside the speculative decode worker. But it needs to be tried / issues ironed out.
Hi @cadedaniel . Does the pipeline mean draft model and target model? Like we have 2 stages, stage 1 only runs draft model, stage 2 only runs target model.
🚀 The feature, motivation and pitch
We can combine pipeline parallelism with speculative decoding to get latency reductions, especially when serving Llama 405b over two nodes.
The speculative decoding framework design should support pipeline parallelism by wrapping normal workers inside the speculative decode worker. But it needs to be tried / issues ironed out.
Alternatives
No response
Additional context
No response