Open simon-mo opened 1 month ago
I want to highlight that, the re-arch will only affect vllm developers who need to change vLLM's code, in a positive way to make their lives easier. For vLLM users who use vLLM directly, there would be no breaking changes except for beam-search. And we hope to bring better performance for users as well as an extensible architecture for developers.
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
Take support for encode only models as an example
Although the encode only models is much simpler than the decode model, they are very different.
The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
PTAL #8453 #8452
I'm implementing async scheduler (Async single-step scheduling). Beam search and SeqGroupMetadata drive me crazy. Awesome to hear about Beam Search and SeqGroupMetadata removed
mark
Workflow Definition Engine draft pull request is almost complete and there are almost 10,000 lines of code.
as @DarkLight1337 said:
Hi, as mentioned earlier, there is basically no way we can merge all these changes at once. You should break up this refactoring into multiple stages.
Therefore, we hope to invite more people to participate, including but not limited to providing suggestions, participating in discussions, align with vLLM's V2 engine architecture goals, and discussing how to break it into stages, help review code for future PRs
Let me briefly introduce the content of this PR. Including
These models are all from issues and are also very famous:
These models is roughly divided into three categories:
What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.
You can think of prefill only as encode only fancy writing.
New features:
If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
I divided the Engine into the following modules.
With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.
PTAL #8453 #8452
Given Driver process + SPMD workers, it's there a chance to separate LLMEngine process and worker processes on different nodes(servers)? To be more concrete, the OpenAPI server process and LLMEngine process should live on a node with high performance CPU only, while the worker processes should live on normal GPU node(s).
I guess this idea is somehow related to ray spmd worker: https://github.com/vllm-project/vllm/issues/6556, even though I suspect their current implementation is not supporting a distributed LLMEngine process.
Lessons we learned from V1:
To achieve high GPU utilization, we should care about everything happening on the CPU.
- Python is slow.
Scheduling is not cheap.
- For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.
Sampler is expensive.
- However, “pythonizing” the sampler outputs is expensive.
@simon-mo is the team considering moving away from python ?
mark
Lessons we learned from V1: To achieve high GPU utilization, we should care about everything happening on the CPU.
- Python is slow.
Scheduling is not cheap.
- For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.
Sampler is expensive.
- However, “pythonizing” the sampler outputs is expensive.
@simon-mo is the team considering moving away from python ?
Probably easier to cythonize critical bits and wait for PY3.13 support in torch
We notice that, when input lengths are short, for example less than 200, the prefill stages costs too much GPU idle. If python code is too slow to make gpu busy, can we take prefills for short sequences into cuda graph?
This issues describes the high level directions that "create LLM Engine V1". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.
Goal:
Non-goals, the following are important but orthogonal:
The scope is exclusively in the scheduler, memory manager, distributed architecture. We will not touch APIs, models, kernels, and most parts of the model runner.
Highlights of the new design:
Lessons we learned from V1:
self.running
queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.Timeline wise, we plan to execute the changes incrementally. Overtime we will add PRs and issues related to the new architecture here.
The design is led by the vLLM maintainers @WoosukKwon @zhuohan123 @youkaichao @simon-mo @LiuXiaoxuanPKU @comaniac @alexm-neuralmagic @njhill @robertgshaw2-neuralmagic @rkooo567 and many others!