vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.37k stars 4.6k forks source link

vLLM's V1 Engine Architecture #8779

Open simon-mo opened 1 month ago

simon-mo commented 1 month ago

This issues describes the high level directions that "create LLM Engine V1". We want the design to be as transparent as possible and created this issue to track progress and solicit feedback.

Goal:

Non-goals, the following are important but orthogonal:

The scope is exclusively in the scheduler, memory manager, distributed architecture. We will not touch APIs, models, kernels, and most parts of the model runner.

Highlights of the new design:

Lessons we learned from V1:

Timeline wise, we plan to execute the changes incrementally. Overtime we will add PRs and issues related to the new architecture here.

The design is led by the vLLM maintainers @WoosukKwon @zhuohan123 @youkaichao @simon-mo @LiuXiaoxuanPKU @comaniac @alexm-neuralmagic @njhill @robertgshaw2-neuralmagic @rkooo567 and many others!

youkaichao commented 1 month ago

I want to highlight that, the re-arch will only affect vllm developers who need to change vLLM's code, in a positive way to make their lives easier. For vLLM users who use vLLM directly, there would be no breaking changes except for beam-search. And we hope to bring better performance for users as well as an extensible architecture for developers.

noooop commented 1 month ago

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

PTAL #8453 #8452


I'm implementing async scheduler (Async single-step scheduling). Beam search and SeqGroupMetadata drive me crazy. Awesome to hear about Beam Search and SeqGroupMetadata removed

lixiaolx commented 1 month ago

mark

noooop commented 1 month ago

Workflow Definition Engine draft pull request is almost complete and there are almost 10,000 lines of code.

as @DarkLight1337 said:

Hi, as mentioned earlier, there is basically no way we can merge all these changes at once. You should break up this refactoring into multiple stages.

Therefore, we hope to invite more people to participate, including but not limited to providing suggestions, participating in discussions, align with vLLM's V2 engine architecture goals, and discussing how to break it into stages, help review code for future PRs

Let me briefly introduce the content of this PR. Including

  1. what new models need to be supported,
  2. what new features these new models have, and
  3. how engine Architecture needs to support these features flexibly and efficiently.

What new models need to be supported

These models are all from issues and are also very famous:

These models is roughly divided into three categories:

What new features these new models have

What the above three categories have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

New features:

  1. attention
    • Prefill only models requires simpler attention implementations, no need to consider kvcache, no decoding phase
    • We need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.
  2. scheduler
    • Prefill only models requires simpler scheduler, no need to consider kvcache and preemption
    • Prefill only models, there is no correlation between tasks, so it is easy to implement async scheduling
  3. executer
    • In order to support async scheduling, model_input_builder needs to be separated from the runner.
    • The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
    • If async scheduling and async execution are implemented, data parallelism is also easy to implement. Data parallelism is more efficient for small models

How engine Architecture needs to support these features flexibly and efficiently.

If we directly add new functions to existing modules, these modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

The most flexible and efficient way to support the prefill only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

I divided the Engine into the following modules.

With wde, there is no need for one module to be compatible with all functions. You can use the dynamic loading feature of python to load different modules at the highest level, for different models and different needs.

PTAL #8453 #8452

Yang-x-Zhao commented 1 month ago

Given Driver process + SPMD workers, it's there a chance to separate LLMEngine process and worker processes on different nodes(servers)? To be more concrete, the OpenAPI server process and LLMEngine process should live on a node with high performance CPU only, while the worker processes should live on normal GPU node(s).

I guess this idea is somehow related to ray spmd worker: https://github.com/vllm-project/vllm/issues/6556, even though I suspect their current implementation is not supporting a distributed LLMEngine process.

Venkat2811 commented 1 month ago

Lessons we learned from V1:

To achieve high GPU utilization, we should care about everything happening on the CPU.

  • Python is slow.

Scheduling is not cheap.

  • For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.

Sampler is expensive.

  • However, “pythonizing” the sampler outputs is expensive.

@simon-mo is the team considering moving away from python ?

yuki252111 commented 1 month ago

mark

wedobetter commented 1 week ago

Lessons we learned from V1: To achieve high GPU utilization, we should care about everything happening on the CPU.

  • Python is slow.

Scheduling is not cheap.

  • For every step, the vLLM scheduler goes over the whole self.running queue and performs some operations for each request (e.g., allocating a new block). And this is written in Python.

Sampler is expensive.

  • However, “pythonizing” the sampler outputs is expensive.

@simon-mo is the team considering moving away from python ?

Probably easier to cythonize critical bits and wait for PY3.13 support in torch

sleepwalker2017 commented 6 days ago

We notice that, when input lengths are short, for example less than 200, the prefill stages costs too much GPU idle. If python code is too slow to make gpu busy, can we take prefills for short sequences into cuda graph?