[RFC]: Asynchronous Output Processor

Motivation.

Each decoding step inside LLMEngine does the following: schedules the sequences to be executed in the next iteration, executes the model and process model outputs. GPU remains largely blocked while executing _process_model_outputs() in LLMEngine.

We ran detailed profiling with LLaMa3-70B-instruct on 4xH100 (FP8 scales) and measured time taken by this function inside LLMEngine.

Case study: Model: Llama3-70B Hardware: 4xH100 Dtype: FP8 dynamic scaling 128 input tokens, varying batch sizes

Batch size	TPOT (ms)	Time taken by `_process_model_outputs` (ms)	Expected reduction in TPOT (ms)
1	19	0.2	0.6%
32	25	3.4	2.1%
1024	19	0.2	20%

[Note: I'm generating new numbers using latest main branch. Will update soon.] Above numbers are conservative estimates for performance improvement.

Proposed Change.

Currently, sampler is just waiting for CPU-GPU synchronization. This solution takes advantage of this idle time and proposes to asynchronously process output while forward pass is being executed on GPU.

High-level changes Introducing a new RPC server+client inside AsnycLLMEngine to run model output processing in a separate process. This server will exclusively do all ops inside _process_model_outputs asynchronously. Example of these ops: de-tokenization, checking stopping criteria and pythonizing objects to be sent back to api server.

Changes to AsyncLLMEngine

Introduce a new function _advance_to_next_step that performs a subset of functions from _process_model_outputs in LLMEngine. Eg: appending new token id, updating statuses' of prefill sequences to decode. These steps will be in the critical path as they are required to schedule the next step.

Changes to Scheduler

Decouple scheduler states' updates from _process_model_outputs. We remove these updates (eg: free-ing block tables of finished sequences and removing them from running queue) from OutputProcessor and add it to scheduler.

With this change, flow of execution inside every decoding step t will look like:

Inputs are prepared and scheduled in the same way
Model is executed
CPU runs ahead of GPU and is now waiting on CPU-GPU sync while model is being executed.
At this point, we asynchronously start processing of outputs from previous step t-1
Model execution for step t is finished. Outputs are sampled
_advance_to_next_step is called: this updates states of sequences to be scheduled for next iteration
Sequences that are finished are free-d inside scheduler

Some things to note about this implementation:

Because stopping criteria will be checked asynchronously and therefore lagging by a step, this means there will be one extra decoding step executed for every sequence. But given the reduction in overall TPOT, we expect this cost will be amortized and minimal when averaged over a few decoding steps.
If there is a token generated after eos token id, then it must not be streamed back to the user. This will be handled inside new OutputProcessor server logic.

Feedback Period.

No response

CC List.

cc: @WoosukKwon @zhuohan123 also cc @SolitaryThinker for feedback on how this will interact with multi-step scheduling

Any Other Things.

Thanks to @WoosukKwon for sharing feedback on design docs.

vllm-project / vllm