vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.36k stars 3.86k forks source link

[RFC]: Asynchronous Output Processor #6913

Open megha95 opened 1 month ago

megha95 commented 1 month ago

Motivation.

Each decoding step inside LLMEngine does the following: schedules the sequences to be executed in the next iteration, executes the model and process model outputs. GPU remains largely blocked while executing _process_model_outputs() in LLMEngine.

We ran detailed profiling with LLaMa3-70B-instruct on 4xH100 (FP8 scales) and measured time taken by this function inside LLMEngine.

Case study: Model: Llama3-70B Hardware: 4xH100 Dtype: FP8 dynamic scaling 128 input tokens, varying batch sizes

Batch size TPOT (ms) Time taken by _process_model_outputs (ms) Expected reduction in TPOT (ms)
1 19 0.2 0.6%
32 25 3.4 2.1%
1024 19 0.2 20%

[Note: I'm generating new numbers using latest main branch. Will update soon.] Above numbers are conservative estimates for performance improvement.

Proposed Change.

Currently, sampler is just waiting for CPU-GPU synchronization. This solution takes advantage of this idle time and proposes to asynchronously process output while forward pass is being executed on GPU.

High-level changes Introducing a new RPC server+client inside AsnycLLMEngine to run model output processing in a separate process. This server will exclusively do all ops inside _process_model_outputs asynchronously. Example of these ops: de-tokenization, checking stopping criteria and pythonizing objects to be sent back to api server.

Changes to AsyncLLMEngine

Changes to Scheduler

With this change, flow of execution inside every decoding step t will look like:

Some things to note about this implementation:

Feedback Period.

No response

CC List.

cc: @WoosukKwon @zhuohan123 also cc @SolitaryThinker for feedback on how this will interact with multi-step scheduling

Any Other Things.

Thanks to @WoosukKwon for sharing feedback on design docs.

zhouyuan commented 1 month ago

+1 The issue is even worse on small models - we do see the process output nearly occupied ~15% of execution time on small model for CPU backend

thanks, -yuan