[Performance] [Speculative decoding]: Compute prepare inputs of the scoring model on GPU

Proposal to improve performance

TL;DR: Move speculative decoding scoring prepare inputs to GPU, so a CPU synchronization can be skipped.

Currently, speculative decoding copies proposal tokens to CPU, where the spec decode framework then creates a ExecuteModelRequest for the target model to use in scoring. This synchronization (copying to CPU) takes ~1ms. When batch expansion is used (currently always used), it is another 0.5-1ms spent in python processing. Lastly, the ExecuteModelRequest is given to the scoring model, which runs prepare inputs (300µs). End-to-end, this optimization should shave off ~1.3ms for MQA and ~2ms for batch expansion.

In MQA, this is the logic which must be lowered to GPU: https://github.com/vllm-project/vllm/pull/6185/files#diff-fbbe61dca888281fce79855e67ce76bbaf6944d957a2963fced77e0f170f03c8R54-R70
In batch expansion, this is the logic which must be lowered to GPU: https://github.com/vllm-project/vllm/blob/7f8d612d24c66e9b5f8c0aa6cb562e853e9523a0/vllm/spec_decode/batch_expansion.py#L64-L102
Testing: suggest following the pattern in e2e block manager v2 tests, where we run v1 and v2 versions and compare greedy output: https://github.com/vllm-project/vllm/blob/7f8d612d24c66e9b5f8c0aa6cb562e853e9523a0/tests/core/block/e2e/test_correctness.py#L36-L85

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

vllm-project / vllm