vLLM's V1 Engine Architecture

simon-mo commented 1 month ago

This issues describes the high level directions that "create LLM Engine V1". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.

Goal:

The new engine will be simple and performant. We found the first iteration of the engine to be simple, the multistep engine to be performant, but we want best of the both worlds. For it to be performat, we want to minimize GPU idle time.
The new architecture will be extensible and modular. We found the current codebase becoming difficult to extend and add new features (both production and experimental features) due to the hard tangling of different features. In the new design, features should be compatible with each other.
Tech debts will be cleaned up. We will remove optimizations that compromise code readability. We will also redo ad-hoc implementations to support certain features/models.

Non-goals, the following are important but orthogonal:

Optimize GPU time/kernels
Add new features/optimizations
Performance in rare cases

The scope is exclusively in the scheduler, memory manager, distributed architecture. We will not touch APIs, models, kernels, and most parts of the model runner.

Highlights of the new design:

Driver process + SPMD workers
- When TP=n & PP=m, vLLM engine will have n*m + 1 processes in total.
  - Corollary: even when using a single GPU, we will have 2 processes.
- The driver process will have the scheduler, memory manager, etc.
- The workers are stateful, maintaining most of the request states.
  - The driver will only send the “diffs”
    - New request: input token IDs & block tables & sampling params, etc.
    - In-flight request: scheduled request IDs, new block IDs (no token IDs, sampling params, etc.)
- Clean up data structures like SeqGroupMetadata
Async single-step scheduling, instead of multi-step scheduling
- Scheduler will schedule the n+1-th step, while the worker is executing the n-th step.
- We will reuse the code from multi-step scheduling to incrementally update the model inputs.
- Needs a special care for PP, since the output token IDs from the last stage should be sent to the first stage.
De-tokenizer moves to the driver process
- Async de-tokenization can be regarded as part of async scheduling
Native support for different types of model states
- Regular KV cache, Mamba cache, encoder cache, etc.
- Dedicated memory manager & block table for each type of cache
Drop beam search from vLLM engine
- Provide a solution to emulate beam search outside vLLM engine
Prefix-caching as a first-class feature
- Implement parallel sampling via prefix caching
- Remove the concept of SequenceGroup
- Optimize prefix caching overheads
Remove/minimize PyObjectCache

Lessons we learned from V1:

To achieve high GPU utilization, we should care about everything happening on the CPU.
- Python is slow.
- Fast GPUs like H100 do not necessarily have fast CPUs. They may have hundreds of CPU cores, but each with low clock speed.
- Moreover, GPUs will get faster and faster, while CPUs will not.
Scheduling is not cheap.
- For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.
Input broadcasting is expensive.
- Instead of sending request information from scheduler to workers every step, the workers should be stateful and maintain most of the request states.
Preparing the model & sampler inputs (e.g., block table) is expensive.
- We should cache the inputs of the previous steps, and build new inputs incrementally from the cached inputs, if possible.
- However, not every state should be kept in GPU memory. It’s OK to cache & incrementally build some inputs in CPU memory, and send them to GPU every step.
De-tokenization is expensive.
- For every step, vLLM de-tokenizes the generated output token IDs and checks the stop criteria.
- The overhead becomes significant for large batch sizes.
Sampler is expensive.
- The GPU operations themselves are not very expensive.
- However, “pythonizing” the sampler outputs is expensive.
- Plus, the sampler can launch many small GPU kernels with CPU-GPU synchronizations.
Supporting different types of model states (e.g., KV cache, Mamba cache, encoder cache) is challenging.
- We need native cache managers for these different types of caches.
- We need to deal with memory fragmentation due to the different sizes of the different states

Timeline wise, we plan to execute the changes incrementally. Overtime we will add PRs and issues related to the new architecture here.

[x] #8726

The design is led by the vLLM maintainers @WoosukKwon @zhuohan123 @youkaichao @simon-mo @LiuXiaoxuanPKU @comaniac @alexm-neuralmagic @njhill @robertgshaw2-neuralmagic @rkooo567 and many others!

youkaichao commented 1 month ago

I want to highlight that, the re-arch will only affect vllm developers who need to change vLLM's code, in a positive way to make their lives easier. For vLLM users who use vLLM directly, there would be no breaking changes except for beam-search. And we hope to bring better performance for users as well as an extensible architecture for developers.

noooop commented 1 month ago

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

PTAL #8453 #8452

I'm implementing async scheduler (Async single-step scheduling). Beam search and SeqGroupMetadata drive me crazy. Awesome to hear about Beam Search and SeqGroupMetadata removed

lixiaolx commented 1 month ago

mark

noooop commented 1 month ago

Workflow Definition Engine draft pull request is almost complete and there are almost 10,000 lines of code.

as @DarkLight1337 said:

Hi, as mentioned earlier, there is basically no way we can merge all these changes at once. You should break up this refactoring into multiple stages.

Therefore, we hope to invite more people to participate, including but not limited to providing suggestions, participating in discussions, align with vLLM's V2 engine architecture goals, and discussing how to break it into stages, help review code for future PRs

Let me briefly introduce the content of this PR. Including

what new models need to be supported,
what new features these new models have, and
how engine Architecture needs to support these features flexibly and efficiently.

What new models need to be supported

These models are all from issues and are also very famous:

xlm_roberta
bge-m3
bge-reranker-v2-m3
bert
bge v1.5 family
Snowflake Arctic Embed (Family)
gte-Qwen2
This list is still growing

These models is roughly divided into three categories：

Encode only models. (Bidirectional Transformers, causal=False), Often fine-tuned as retriever and reranker etc.
Decode only models. (masked multi-head attention, causal=True). There are two interesting uses:
- Output last hidden states as a feature extractor
- Decode only retriever （I don't know of a better name），E.g. e5-mistral-7b （The only Embed model currently supported by vllm)
- Whether it has been fine-tuned or not, there is almost no difference in the code.
- Enable bidirectional. LLM2Vec propose a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.

What new features these new models have

What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

New features:

attention
- Prefill only models requires simpler attention implementations, no need to consider kvcache, no decoding phase
- We need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.
scheduler
- Prefill only models requires simpler scheduler, no need to consider kvcache and preemption
- Prefill only models, there is no correlation between tasks, so it is easy to implement async scheduling
executer
- In order to support async scheduling, model_input_builder needs to be separated from the runner.
- The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
- If async scheduling and async execution are implemented, data parallelism is also easy to implement. Data parallelism is more efficient for small models

How engine Architecture needs to support these features flexibly and efficiently.

If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

I divided the Engine into the following modules.

InputProcessor: The llm models inputs strings, the reranker inputs pairs, and the multimodal model input is more complex...
OutputProcessor: The retriever(embedding) models output embeddings, reranker models and classification models output Scores...
ModelInputBuilder: Building model inputs and attention metadata
AttnBackend: Support different AttnBackend and enable bidirectional
Tokenizer: There may be different tokenizers
Executor: Sync\Async\TP\PP\DP\maybe more
Worker & runner: Support different devices\maybe more
EngineArgs: Different models, different config may accept different parameters
maybe more

With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.

Modules can be configured through Workflow, plug and play
Flexibly support plug-ins, and developers can load their own modules.
Workflow is really the best place to hide dirty codes. Some models cannot use the common Workflow. When you don’t know where to put the dirty code, you can always create a new workflow and link the model architecture to the new workflow to avoid leaving dirty code everywhere for the sake of compatibility.

PTAL #8453 #8452

Yang-x-Zhao commented 1 month ago

Given Driver process + SPMD workers, it's there a chance to separate LLMEngine process and worker processes on different nodes(servers)? To be more concrete, the OpenAPI server process and LLMEngine process should live on a node with high performance CPU only, while the worker processes should live on normal GPU node(s).

I guess this idea is somehow related to ray spmd worker: https://github.com/vllm-project/vllm/issues/6556, even though I suspect their current implementation is not supporting a distributed LLMEngine process.

Venkat2811 commented 1 month ago

Lessons we learned from V1:

To achieve high GPU utilization, we should care about everything happening on the CPU.

Python is slow.

Scheduling is not cheap.

For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.

Sampler is expensive.

However, “pythonizing” the sampler outputs is expensive.

@simon-mo is the team considering moving away from python ?

yuki252111 commented 1 month ago

mark

wedobetter commented 1 week ago

Lessons we learned from V1: To achieve high GPU utilization, we should care about everything happening on the CPU.

Python is slow.

Scheduling is not cheap.

For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.

Sampler is expensive.

However, “pythonizing” the sampler outputs is expensive.

@simon-mo is the team considering moving away from python ?

Probably easier to cythonize critical bits and wait for PY3.13 support in torch

sleepwalker2017 commented 6 days ago

We notice that, when input lengths are short, for example less than 200, the prefill stages costs too much GPU idle. If python code is too slow to make gpu busy, can we take prefills for short sequences into cuda graph?

vllm-project / vllm