TL;DR: Move speculative decoding scoring prepare inputs to GPU, so a CPU synchronization can be skipped.
Currently, speculative decoding copies proposal tokens to CPU, where the spec decode framework then creates a ExecuteModelRequest for the target model to use in scoring. This synchronization (copying to CPU) takes ~1ms. When batch expansion is used (currently always used), it is another 0.5-1ms spent in python processing. Lastly, the ExecuteModelRequest is given to the scoring model, which runs prepare inputs (300µs). End-to-end, this optimization should shave off ~1.3ms for MQA and ~2ms for batch expansion.
Proposal to improve performance
TL;DR: Move speculative decoding
scoring
prepare inputs to GPU, so a CPU synchronization can be skipped.Currently, speculative decoding copies proposal tokens to CPU, where the spec decode framework then creates a ExecuteModelRequest for the target model to use in scoring. This synchronization (copying to CPU) takes ~1ms. When batch expansion is used (currently always used), it is another 0.5-1ms spent in python processing. Lastly, the ExecuteModelRequest is given to the scoring model, which runs prepare inputs (300µs). End-to-end, this optimization should shave off ~1.3ms for MQA and ~2ms for batch expansion.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response