vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.64k stars 4.65k forks source link

[RFC]: Support encode only models by Workflow Defined Engine #8453

Open noooop opened 2 months ago

noooop commented 2 months ago

Motivation.

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

Terminology.

The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories:

What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

add more: Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.

Proposed Change.

SUMMARY:

  1. Prefill only models requires simpler attention implementations (prefill only, no kvcache...)
  2. Prefill only models requires simpler scheduler. (no kvcache, no preemption...)
  3. In order to support asynchronous scheduling, model_input_builder needs to be separated from the runner. The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
  4. With wde, there is no need for one module to be compatible with all functions. You can always use the workflow to load new modules at the highest level to support new functions.

Feedback Period.

No response

CC List.

No response

Any Other Things.

PTAL #8452

Supported models:

Features supported and tested:

WIP:

Functions that have not yet, but are relatively important

Anyway, I hope vllm can support prefill only models as soon as possible

Before submitting a new issue...

noooop commented 2 months ago

benchmarks:

for bge-m3 for xlm-roberta

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

xlm-roberta-base

xlm-roberta-large

bge-m3

The speed of wde is significantly faster than hf under various batch sizes.

profiler

simple_execute_loop

double_buffer_execute_loop

Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled.

noooop commented 2 months ago

benchmarks different attention implementations:

FlashInfer Backend (Because encode only models do not involve kv cache, When using Flashinfer backend in encode only models, you are actually using FLASH ATTN backend

code

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

fp32

fp16

bf16

Flash Attention Backend is the fastest, no surprise at all

FLASH_ATTN

When using FLASH_ATTN, bf16 and fp16 are almost the same speed

noooop commented 1 month ago

@DarkLight1337

I am doing the final code cleanup. Please sort out the issues related to prefill only models. I will solve them as much as possible.

And there are almost 10,000 lines of code. This pr is not going to support multimodal llm.

liweiqing1997 commented 1 month ago

This is a good job and I really need help with this job