vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.17k stars 3.83k forks source link

[Performance] [Speculative decoding]: Compute prepare inputs of the scoring model on GPU #6915

Open cadedaniel opened 1 month ago

cadedaniel commented 1 month ago

Proposal to improve performance

TL;DR: Move speculative decoding scoring prepare inputs to GPU, so a CPU synchronization can be skipped.

Currently, speculative decoding copies proposal tokens to CPU, where the spec decode framework then creates a ExecuteModelRequest for the target model to use in scoring. This synchronization (copying to CPU) takes ~1ms. When batch expansion is used (currently always used), it is another 0.5-1ms spent in python processing. Lastly, the ExecuteModelRequest is given to the scoring model, which runs prepare inputs (300µs). End-to-end, this optimization should shave off ~1.3ms for MQA and ~2ms for batch expansion.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

cadedaniel commented 1 month ago

@sroy745 is taking an initial look. cc @alexm-neuralmagic @comaniac as an FYI, we will start with your kernel :)