Open noooop opened 1 month ago
benchmarks:
test on 4090 * 1
Throughput on the abscissa, delay on the ordinate, lower right is better
The speed of wde is significantly faster than hf under various batch sizes.
Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled.
benchmarks different attention implementations:
FlashInfer Backend (Because encode only models do not involve kv cache, When using Flashinfer backend in encode only models, you are actually using FLASH ATTN backend
test on 4090 * 1
Throughput on the abscissa, delay on the ordinate, lower right is better
Flash Attention Backend is the fastest, no surprise at all
When using FLASH_ATTN, bf16 and fp16 are almost the same speed
@DarkLight1337
I am doing the final code cleanup. Please sort out the issues related to prefill only models. I will solve them as much as possible.
And there are almost 10,000 lines of code. This pr is not going to support multimodal llm.
This is a good job and I really need help with this job
Motivation.
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
Take support for encode only models as an example
Although the encode only models is much simpler than the decode model, they are very different.
The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
Terminology.
The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories:
What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.
You can think of prefill only as encode only fancy writing.
add more: Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.
Proposed Change.
SUMMARY:
Feedback Period.
No response
CC List.
No response
Any Other Things.
PTAL #8452
Supported models:
Features supported and tested:
WIP:
Functions that have not yet, but are relatively important
For small models, data parallelism is more efficient
Anyway, I hope vllm can support prefill only models as soon as possible
Before submitting a new issue...