Open shermansiu opened 1 year ago
Yes, this is in our plan. Adding these models requires modifying vLLM's cache block manager to also manage the attention cache of the encoder, which is a notable modification. Feel free to talk to us if you are interested to contribute and accelerate this process.
So... to contribute, we would need to re-implement the model in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/self.attn
with a paged version and use a KVCache
during computation)?
It also seems like most linear projections are replaced with either ColumnParallelLinear or RowParallelLinear, right? So nn.Linear(small, big)
is replaced with ColumnParallelLinear(small, big)
(thus parallelizing the large number of columns) and nn.Linear(big, small)
is replaced by RowParallelLinear(big, small)
?
Are https://github.com/vllm-project/vllm/pull/60/files and https://github.com/vllm-project/vllm/pull/50/files good reference PRs for this?
I see you've already answered this in the FAQ here: https://vllm.readthedocs.io/en/latest/models/adding_model.html
@zhuohan123 Hi, I'm interested in implementing support for encode-decoder models. Does it require any changes other than what's listed in https://vllm.readthedocs.io/en/latest/models/adding_model.html?
@WoosukKwon @zhuohan123 Hi, my team plans to work on T5 support. We would like to ask a few questions before we start.
block_manager.py
and implementing the model in t5.py
. Are there any other components that need to change. Could you briefly describle how to implement this with minimal change?Any help is appreciated. Thanks in advance!
@WoosukKwon @zhuohan123 Hi, my team plans to work on T5 support. We would like to ask a few questions before we start.
- Is the vLLM team currently working or planning to work on this? If so then there's no point for us to do it.
We are not actively working on this. Please go ahead!
- @zhuohan123 said above that it requires cache block manager to also manage the attention cache of the encoder. However, AFAIU encoder doesn't need kv caches. Instead it should manage decoder's cross attention kv cache, right?
Yeah I think the point is to maintain the cross-attention kv cache generated by the encoder. I believe this cache should also be included in our block managed and managed in a blocked fashion, because it's size depends on the input size, which can be highly variable.
- Apart from managing cross attention kv cache in
block_manager.py
and implementing the model int5.py
. Are there any other components that need to change. Could you briefly describle how to implement this with minimal change?
Some points I can think of:
profile_num_available_blocks()
. This function profiles the maximum memory usage of the model, which may need to be changed because of the encoder-decoder structure.t5.py
, you might need to look at the input and check whether the input is a prompt run or a generation run. If it's prompt run, you call the encoder and feed <sos>
to the decoder and run the first decoder run. If it's a generation run, you only call the decoder.I believe there can be some other places in our code where we assume the model is decoder only.
Any help is appreciated. Thanks in advance!
Thanks for taking this and please let us know if there's any issue! We are also happy to chat online if you need more detailed suggestions. Feel free to shoot me an email at zhuohan[at]berkeley.edu
.
Also, I suppose the encoder cache eviction would be different.
i.e. The encoder's cross-attention values would need to be kept as long as the decoding is active for a prompt, but can be evicted the moment the generation is completed.
(Never mind, for the sake of simplicity, LRU should work just fine)
cc @rib-2
Update: I'm very close to finishing this. I've run T5 with vllm successfully on my local machine. I think I will be able to submit a PR in the coming weeks.
@js8544 Hello, is there any progress on this now? I would like to use it. Thank you
would this include BART?
Hello @js8544 thank you so much for this work. My team is very interested in encoder/decoder.
I would like to offer to help with landing this PR. How can I assist?
Once the encoder/decoder feature is landed, our team plans to integrate Whisper (audio speech recognition) support on top of it. This motivates the interest in supporting encoder/decoder work. @zhuohan123 FYI this relates to
I just submitted a draft PR: https://github.com/vllm-project/vllm/pull/3117. There are still some problems to solve. I would really appreciate any comments or advice.
I tried the pull request, T5 worked but BART did not.
@Elsayed91 did you write your own BART implementation? What was the nature of the issue?
Status update on encoder/decoder models & T5:
It has become clear that the aforementioned work rightfully belongs in at least two medium-small sized PRs, rather than a single large PR:
PR 1: vLLM infrastructure to support encoder/decoder, along with unit tests
PR 2: Support for T5
My experience working on T5 integration suggests to me that T5's relative positional encoding relies on "custom attention bias" which is (1) not supported by vLLM flash_attn, (2) difficult to integrate efficiently into the existing vLLM workflow, and (3) really an entirely different task from encoder/decoder. Thus T5 support belongs in its own PR.
More on the impact which custom bias has on the outcome of working with models like T5 can be found in the comments on this post https://twitter.com/birchlabs/status/1782791645961859142?s=46
Note that Whisper support (https://github.com/vllm-project/vllm/issues/180) takes a dependency on encoder/decoder as well, and will also be in a separate PR.
@Elsayed91 did you write your own BART implementation? What was the nature of the issue?
Status update on encoder/decoder models & T5:
It has become clear that the aforementioned work rightfully belongs in at least two medium-small sized PRs, rather than a single large PR:
PR 1: vLLM infrastructure to support encoder/decoder, along with unit tests
PR 2: Support for T5
- Draft PR: TBD
My experience working on T5 integration suggests to me that T5's relative positional encoding relies on "custom attention bias" which is (1) not supported by vLLM flash_attn, (2) difficult to integrate efficiently into the existing vLLM workflow, and (3) really an entirely different task from encoder/decoder. Thus T5 support belongs in its own PR.
More on the impact which custom bias has on the outcome of working with models like T5 can be found in the comments on this post https://twitter.com/birchlabs/status/1782791645961859142?s=46
Note that Whisper support (#180) takes a dependency on encoder/decoder as well, and will also be in a separate PR.
I totally agree. The relative attention bias of T5 was very painful to implement, and is not necessary for other enc-dec models like whisper. I can add T5 support after your enc-dec infra pr is merged.
btw Bart would be simpler than T5 because it uses the original Transformer structure. Maybe we can do Bart first.
Quick update: the PR to support cross-attention caching (https://github.com/vllm-project/vllm/pull/4837) has been landed. Now I am working on landing the PR to correctly invoke the attention kernel for cross-attention (https://github.com/vllm-project/vllm/pull/4888).
Update:
Is there any documentation for inference Bart type model? Thanks.
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942
Hello @anonymousz97 , this PR
https://github.com/vllm-project/vllm/pull/4942
will include BART support & example code for invoking BART. This PR is WIP but should be ready for review soon.
hi @afeldman-nm! is this pr going to support also https://huggingface.co/facebook/bart-large-mnli?
thank you
Thanks, i will try it @afeldman-nm af
Update: #4888 is landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 . #4942 completes end-to-end support for encoder/decoder models & also introduces the BART model into vLLM. #4942 is still WIP.
Does it support MBartForConditionalGeneration model @afeldman-nm ? Thanks
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942
Hello @anonymousz97 , this PR
4942
will include BART support & example code for invoking BART. This PR is WIP but should be ready for review soon.
@afeldman-nm Will that PR include T5 support?
FYI encoder/decoder support has landed ( #4942 ); there is an example in examples/offline_inference_encoder_decoder.py
. BART has been integrated in to vLLM (T5 and Whisper have not, to answer a previous question.)
Currently vLLM encoder/decoder support is constrained in what features it is compatible with (i.e. not CUDAGraph, not pipeline parallelism, ...) So it is now a goal to make more features compatible with vLLM's encoder/decoder processing pipeline.
To that end RFC #7366 overviews the vLLM features which are currently not compatible encoder/decoder, with an eye for bringing vLLM's encoder/decoder support to parity with vLLM's decoder-only support.
Additionally #7366 proposes adding custom attention bias support as well as the Whisper and T5 models.
The RFC feedback period is 1 week (until August 16th)
Will there be added support for encoder-decoder models, like T5 or BART? All of the currently supported models are decoder-only.