vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.53k stars 3.88k forks source link

[RFC]: Encoder/decoder models & feature compatibility #7366

Open afeldman-nm opened 1 month ago

afeldman-nm commented 1 month ago

Motivation #

There is significant interest in vLLM supporting encoder/decoder models. Issues #187 and #180 , for example, request encoder/decoder model support. As a result encoder/decoder support was recently introduced to vLLM via the following three PRs:

These three PRs make encoder/decoder model inference possible; however, they leave more to be desired in terms of (1) parity between vLLM's decoder-only & encoder/decoder request processing pipelines with respect to feature support, and (2) the number of encoder/decoder models which are supported.

The ask for the vLLM community is to contribute PRs which help bring vLLM encoder/decoder functionality to a similar level of maturity as that of vLLM's decoder-only functionality.

Proposed changes #

The support matrix below summarizes which encoder/decoder models have already been added & which features are currently compatible with the vLLM encoder/decoder pipeline, versus which features & models will require additional PRs to implement in the long-term:

Model/feature Model is already available/feature is already compatible with encoder-decoder? Having this model/making this feature compatible is a long-term goal?
Encoder/decoder infrastructure Yes Yes
BART Yes Yes
Whisper No Yes
T5 No Yes
Other enc/dec models No Yes
Quantization Untested Yes
Multimodality No Yes
Attention backends other than Xformers (esp. flash-attn, flashinfer) No Yes
Custom attention bias support No Yes
CUDAGraph No
(Issue #7447)
Yes
Pipeline parallelism No Yes
Speculative decoding No Low-priority but nice-to-have; difficult.
Automatic prefix caching No Low-priority; difficult.
Sliding window No No
Chunked prefill No No
LoRA No No

This RFC gives an overview of those features & models which are not compatible with encoder/decoder currently, but which should be made compatible eventually (i.e. No in the second column, Yes in the third column in the support matrix.)

Note that there are features (automatic prefix caching/sliding window/chunked prefill/LoRA) which are not long-term compatibility goals.

Background #

Before continuing, it will be helpful to review the details of the new vLLM encoder/decoder infrastructure.

It will also be helpful to review this how-to guide for adding new encoder/decoder models & improving encoder/decoder feature compatibility.

Initial goal #

Members of the vLLM contributor community identify models/features in the support matrix above, for which they will work on writing a PR.

Detailed long-term goals #

Add new models to vLLM #

Please review the how-to guide for adding new models to vLLM

See tests/models/test_bart.py for an example of an encoder/decoder model unit test. See tests/distributed/test_basic_distributed_correctness_enc_dec.py for an example of an encoder/decoder model test with TP > 1.

Add Whisper model #

Steps to add support for Whisper, a multimodal encoder/decoder speech recognition model:

Proposal: consider whether or not it makes sense to implement encoder/decoder multimodality, audio support, and Whisper in the same PR; that way, the Whisper model may be used to facilitate an end-to-end test with of audio multimodality.

Add T5 model #

Note: T5 depends on custom attention bias being supported by at least one of the attention backends which also supports encoder attention & cross-attention; at time of writing no vLLM attention backend fulfills this requirement. The vLLM XFormers attention backend is the only backend which supports encoder/decoder models but neither it nor any other vLLM attention backend supports custom attention bias. (Custom attention bias is required in order to support T5 relative positional encoding.)

Steps to add support for the T5 model:

Note: T5 was added to an older version of vLLM in #3117 , which could be a helpful starting-point

Add other encoder/decoder models

Quantization #

The goal of this workstream is to make sure that quantization + encoder/decoder models is fully-tested, and to fill in any gaps (should they exist) in vLLM's support for quantized encoder/decoder models.

Steps to ensure that vLLM supports encoder/decoder models in combination with all existing vLLM quantization methods:

vLLM encoder/decoder infrastructure should be compatible with most of the existing vLLM quantization methods, because the specialized quantization kernels are only employed for GEMM operations involving the learned weight matrices ($W_q$, $W_k$, etc.), whereas the encoder/decoder work really only modifies how the Attention(q, k, v, kv_cache) layer behaves & does not impact the learned weight matrices at all.

It is less clear whether vLLM encoder/decoder infrastructure would be incompatible with FP8. It does appear that a specialized quantized KV cache kernel is employed by the Attention(q, k, v, kv_cache) layer when FP8 quantization is employed.

Support encoder/decoder multimodality #

Technically, vLLM already supports multimodality for models which have an "encoder" and a "decoder", i.e. Llava. However, Llava's decoder does not utilize cross-attention & the model is basically compatible with vLLM's pre-existing decoder-only infrastructure.

But critically, for encoder/decoder models with cross-attention such as Whisper vLLM does not currently support multimodality of any sort. The processing pipeline does not extract or utilize multimodal data from the input prompt, and the EncoderDecoderModelRunner has an assert which fails if the multimodal config is not None. Addressing this is what is meant by "supporting encoder/decoder multimodality".

Steps to extend existing vLLM multimodality support to encoder/decoder models:

There are a number of multimodal encoder/decoder models which will benefit from this feature. One possibility is to add multimodality support & a multimodal model such as Whisper in the same PR, so that Whisper may be used to facilitate an end-to-end test with multimodality.

Another possibility is to implement multimodality support in its own PR.

Considerations for designing multimodal encoder/decoder prompt formats #

One approach to designing the vLLM multimodal encoder/decoder prompt formats, is to consider what we want the user experience to be for high-priority multimodal encoder/decoder models such as

Initial proposal for multimodal encoder/decoder prompt formats

It may be helpful to review

Generally speaking, in encoder/decoder models based on cross-attention, the non-text input modality is passed to the encoder as input. Conversely, any text prompt is typically passed to the decoder as a input prompt.

The following two encoder/decoder multimodal prompt formats are tentatively proposed:

It may also be worth considering whether or how to support

Add support for encoder attention and cross-attention to additional backends #

At time of writing, XFormers is the only vLLM attention backend which supports encoder attention & cross-attention.

The goal of this workstream would be to extend encoder attention & cross-attention support to additional backends, the highest-priority being flash-attention and flashinfer.

Reviewing encoder attention and cross-attention support in the XFormers backend would be a good starting-point for extending support to other models.

For context on the requirements for a backend to support encoder and cross-attention, it may help to review the encoder/decoder architecture, the way that attention masks are currently constructed in the XFormers backend, and the recommended architecture for vLLM encoder/decoder models.

A summary of the key changes required for an attention backend to support encoder attention and cross-attention:

Initial goals

Long-term goals

Support custom attention bias #

Note: T5 takes a dependency on custom attention bias. Custom attention bias is likely complex enough to merit its own PR.

Note: custom bias support was added to PagedAttention in an older version of vLLM as part of #3117 ; given changes in vLLM since then, additional work would be required to integrate this implementation.

Custom attention bias and relative positional encoding

Attention bias refers to adding a matrix $A$ to the scaled dot-product (SDP) attention scores matrix before performing softmax, as shown below:

$$ attn(Q,K,V,A) = softmax(\frac{Q K^T + A}{\sqrt{d}})V $$

Here, custom attention bias is understood to mean that the vLLM attention backend allows $A$ to be an arbitrary matrix, provided the tensor dimensions are commensurate with the shape of the SDP attention scores matrix. This is in contrast to the existing vLLM attention backend implementations, which can only accommodate simple block-diagonal causal or non-causal masks which are uniformly either $0$ or $-\infty$.

There are broadly two possible approaches to custom attention bias, which do not necessarily have to be mutually-exclusive:

T5 employs custom attention bias in order to implement relative positional encoding, wherein pairwise positional relationships between tokens are represented by the bias matrix. The HuggingFace Transformers T5 implementation provides an example of how the relative positional encoding matrix is computed.

Existing attention bias support

Currently, no vLLM attention backend fully supports passing in a custom attention bias. This is primarily due to underlying kernel limitations. For example, the xFormers memory_efficient_attention_forward kernel is the only NVIDIA-GPU-oriented kernel which permits passing in an arbitrary PyTorch tensor as a materialized attention bias (via the attn_bias argument) (at time of writing I have not investigated if custom attention bias is supported by any of the kernels for AMD GPU, CPU, etc.) Regardless, vLLM only employs xFormers memory_efficient_attention_forward for prefill; to my knowledge, none of the decode-phase kernels employed by vLLM can accept an arbitrary tensor as a custom attention bias, making custom attention bias impossible to apply end-to-end for both prefill and decode under the current vLLM implementation.

In addition to lack of kernel-level support for custom attention bias, most vLLM backends also prevent passing a custom attention bias matrix to the underlying kernel. The exception is the XFormers backend, which accepts an attention bias via XFormersMetadata.attn_bias attribute (however the XFormers backend only utilizes attn_bias in the prefill phase.)

Proposed methods for supporting custom attention bias

Here the following two approaches for supporting custom attention bias in vLLM are proposed:

image

It may make sense to support one or both of these methods.

Note that custom attention bias support must be added on a backend-by-backend basis, because of the kernel modifications & backend logic changes required.

Initial goals for introducing custom attention bias support

  1. Focus on a particular vLLM attention backend
  2. Scope out the effort involved in introducing custom attention bias support to this backend
  3. Some steps which will likely be involved in introducing custom attention bias support:
    • Augment attention backend's kernels to accept custom attention bias; for example, the PagedAttention kernel (for XFormers backend), the Flash-attention kernel (for the flash-attn backend), or the Flashinfer kernels (for the Flashinfer backend)
    • (Except for XFormers) add an attn_bias attribute to attention backend's AttentionMetadata subclass
    • Ensure that the attention backend passes the attn_bias attribute to both the prefill and decode kernels
  4. Add at least two custom attention bias unit tests (for prefill & decode respectively)

Final goals for introducing custom attention bias support

Some links which may be helpful for understanding how causal & non-causal attention masks are currently configured in vLLM:

Support CUDAGraph with encoder/decoder models #

Note: this topic is being tracked by Issue #7447

Steps to support CUDAGraph with encoder/decoder models:

Support pipeline-parallelism with encoder/decoder models #

Steps to support pipeline-parallelism with encoder/decoder models:

Support multi-step scheduling with encoder/decoder models #

Note: depends on #7000 landing in order to add multi-step scheduling support; it may be helpful to review the multi-step scheduling RFC ( #6854 )

Steps to support multi-step scheduling with encoder/decoder models:

Low-priority high-effort tasks #

Here it is proposed that these features are low-priority. Adding support for speculative decoder and automatic prefix caching would require a significant of effort to scope out and design the implementations.

Note that adding support for either of these features would require removing the assertions which fail when speculative decoding or automatic prefix caching are enabled for an encoder/decoder model (see assert_enc_dec_mr_supported_scenario() in vllm/worker/utils.py)

Feedback Period.

Closed.

CC List.

@WoosukKwon @robertgshaw2-neuralmagic @mgoin @tms @njhill @sroy745 @ywang96 @DarkLight1337 @js8544

Any Other Things.

No response

afeldman-nm commented 1 month ago

Looks like footnote references (i.e. [^1], [^2], etc.) are not rendered for RFCs on github, so I just edited the RFC to replace all of the footnote references with direct links.

robertgshaw2-neuralmagic commented 4 weeks ago

cc @mgoin

afeldman-nm commented 3 weeks ago

FYI, added a section to the RFC about adding multi-step scheduling + encoder/decoder support.

@robertgshaw2-neuralmagic