[RFC]: Support encode only models by Workflow Defined Engine

noooop commented 2 months ago

Motivation.

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

Terminology.

The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories：

Encode only models. (Bidirectional Transformers, causal=False), Often fine-tuned as retriever and reranker etc.
Decode only models. (masked multi-head attention, causal=True). There are two interesting uses:
- Output last hidden states as a feature extractor
- Decode only retriever （I don't know of a better name），E.g. e5-mistral-7b （The only Embed model currently supported by vllm)
- Whether it has been fine-tuned or not, there is almost no difference in the code.
- Enable bidirectional. LLM2Vec propose a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
  - Therefore, we need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.

What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

add more: Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.

Proposed Change.

SUMMARY:

Prefill only models requires simpler attention implementations (prefill only, no kvcache...)
Prefill only models requires simpler scheduler. (no kvcache, no preemption...)
In order to support asynchronous scheduling, model_input_builder needs to be separated from the runner. The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
With wde, there is no need for one module to be compatible with all functions. You can always use the workflow to load new modules at the highest level to support new functions.

Feedback Period.

No response

CC List.

No response

Any Other Things.

PTAL #8452

Supported models:

xlm_roberta （#6260）
bge-m3 （#3187 #5737 #6498 #7969）
bge-reranker-v2-m3 (#8022)
bert (#5179 #5447 #7496)
bge v1.5 family which rely on bert ( #7506 #5502)
Snowflake Arctic Embed (Family) (#7792) (The architecture is the same as bge v1.5 family, lucky)
e5-mistral-7b （The only Embed model currently supported by vllm)
output last hidden states (#8483 #853 #7915 #6947)
gte-Qwen2 （#6282 #5827 #5600 #6015 #5611 #7389）
- Because gte-Qwen2 and Qwen2 use the same architecture name，Qwen2ForCausalLM. The code looks very sad!
- gte-Qwen2 family may have multiple different architectures. The code looks very sad!
  - gte-Qwen2-1.5B-instruct, Official sample code sentence_transformers usage and Transformers usage does not use enable bidirectional， discussions. I'm not sure if this is a bug
  - gte-Qwen2-7B-instruct use enable bidirectional

Features supported and tested:

WDE core
Attention Backend for prefill only models
- Flash Attention Backend
- Torch SDPA Backend
- XFormers Backend
- FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
- Torch naive backend (as a control group
Asynchronous scheduling for prefill only models (simple_execute_loop and double_buffer_execute_loop)
output last hidden states
enable bidirectional
data parallelism Not fully tested

WIP:

Limit GPU memory usage by gpu_memory_utilization to avoid oom

Functions that have not yet, but are relatively important

Integrate wde into vllm entrypoints
more Attention Backend support （I only have cuda device
- ROCM_FLASH
- OPENVINO
- PALLAS
- IPEX
Support distributed executer
For small models, data parallelism is more efficient
- tensor parallelism (tensor parallelism is coupled with other parts, can we decouple it?
- pipeline parallelism
Limit GPU memory usage by gpu_memory_utilization to avoid oom
Support quantization models
Support lora
- LLM2Vec (#6584)
maybe more

Anyway, I hope vllm can support prefill only models as soon as possible

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

noooop commented 2 months ago

benchmarks:

for bge-m3 for xlm-roberta

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

xlm-roberta-base

xlm-roberta-large

bge-m3

The speed of wde is significantly faster than hf under various batch sizes.

profiler

simple_execute_loop

double_buffer_execute_loop

Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled.

noooop commented 2 months ago

benchmarks different attention implementations:

FlashInfer Backend (Because encode only models do not involve kv cache, When using Flashinfer backend in encode only models, you are actually using FLASH ATTN backend

code

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

fp32