[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional

cadedaniel commented 2 weeks ago

Background

Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decoding explore the frontier between cost of proposal, alignment with the target model, and cost of verification.

For example, Medusa produces very cheap proposals, but the quality of the proposals are strictly less than Eagle because the heads do not have access to the previous proposals. Eagle on the other hand pays more for the proposals by sampling autoregressively instead of 1-shot, but it brings the benefit of higher-quality proposals.

At the end of the day, what the user cares about will dictate which speculative technique is used. vLLM's job is to provide them with the option for best speedup for their use case.

Draft-model, EAGLE, and MLPSpeculator rely on autoregressive proposals. This means their top-1 proposals are higher-quality than Medusa, which gives vLLM an ITL reduction that is more flops-efficient than Medusa. This is what our speculative decoding efforts are focused on first -- afterward, we can support top-k proposals with Medusa so users who care more about ITL reduction can use vLLM.

Speedup autoregressive proposal methods

This issue is to speed up autoregressive proposal methods by optimizing the sampler. Specifically, the sampler performs wasted work by copying sampled values to GPU and serializing them into Python objects. In speculative decoding, we never use the python objects because we consume the raw sampled token ids / probabilities in their GPU tensors. This means that the copy and CPU serialization are pure overhead in speculative decoding.

How much overhead?

In profiling vLLM, I found that copy + serialization in the draft model takes ~441µs (cell J30). Note that the actual forward pass and sampling math of the draft model take (220µs + 639µs) = 859µs. This means that by removing the unnecessary copy and serialization, we can get 50% more draft tokens in the same time it takes with the copy and serialization enabled.

This difference is actually massive on the overall performance of speculative decoding.

Furthermore, the subsequent draft model forward pass must consume the output of the previous step. This allows us to reduce time spent in prepare_inputs. I don't have numbers here, but I expect a further ~150µs reduction per draft model step by this (~300µs to ~150µs).

The work

This issue is to:

Make the CPU copy and CPU serialization optional in vLLM's sampler (thus leaving sampled token ids on GPU), and then
passing those sampled token ids to prepare_inputs of the next draft model forward pass.

1. Make CPU serialization optional

Warm up task: Note a good warmup task to get familiar with the Sampler is to add an option to disable logprobs for a given Worker. This will also provide some speedup to spec decode (~2ms e2e step time), but isn't part of this issue.

Code pointers:

2. Allow `prepare_inputs` method to work on-device

The on-gpu sampled token ids should be appended to the next prepare_inputs batch.

cadedaniel commented 2 weeks ago

@Yard1 has a good point: the north star here is we can CUDA graph the proposal method, and use an on-device mechanism to prepare input for the next fwd pass. We should get there iteratively, as it should also work for other proposal types (eagle which uses a different flavor of prepare inputs).

alugowski commented 2 weeks ago

Thanks for writing it up! I'll get started on this.

comaniac commented 2 weeks ago

I'll be working on (2)

vllm-project / vllm