huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.65k stars 26.43k forks source link

T5 encoder past_key_values #15591

Closed ZhaofengWu closed 2 years ago

ZhaofengWu commented 2 years ago

Currently only the T5 decoder can use past_key_values, but not the encoder. Why? https://github.com/huggingface/transformers/blob/644ec052332beb160276ceaf9cc9ab6bbf0b89d2/src/transformers/models/t5/modeling_t5.py#L631

My use case is Li & Liang (2021) style prefix tuning (https://arxiv.org/abs/2101.00190), which I think can be implemented by providing past_key_values to the encoder.

And if past_key_values is not a viable option, are there other simple ways to implement this with T5?

dblakely commented 2 years ago

The reason is that the encoder has no need to cache keys and values; all we need to do is cache the encoder output hidden states and then we can reuse those for cross-attention. Ultimately, the T5 encoder is only used one time during generation - just one forward pass at the very beginning of generation and then we have the hidden states and are good to go on that front. On the other hand, the decoder does benefit from caching keys and values, as otherwise we'd have to recompute them every time step during generation (unlike the encoder, the decoder has to do a forward pass every single step).

And if past_key_values is not a viable option, are there other simple ways to implement this with T5?

I'm not familiar with that paper, so I might not be of any help. However, you do have access to the encoder hidden states, which are essentially just a higher dimensional version of the key and value states, so you could look into ways to use those instead. Alternatively, you could modify the T5 code to output the key and value states computed in the encoder.

ZhaofengWu commented 2 years ago

Yes, I understand this motivation. But past_key_values can also be used to implement this method I cited. It is perhaps an unorthodox use of this attribute, but I don't think it should be forbidden. Re. encoder hidden states: past_key_values effectively prepends several trainable tokens at each layer for T5 to attend to -- I don't think manipulating the hidden states can achieve this purpose. And yes I could modify the T5 code, but that would be inelegant. Especially since I need to modify T5Attention, I would have to modify (or inherit) all other T5 classes that use it, either directly or indirectly.

dblakely commented 2 years ago

Re. encoder hidden states: past_key_values effectively prepends several trainable tokens at each layer for T5 to attend to -- I don't think manipulating the hidden states can achieve this purpose.

Do you mean in Huggingface's T5 implementation or in the paper you're trying to implement? The HF T5 implementation simply projects the previous layer's hidden states into key and value states without doing anything special for each layer. So to me it sounds like you might have to customize T5Attention for your use-case anyway.

dblakely commented 2 years ago

Also, I think the maintainers missed this thread, so perhaps make a feature request for passing passing past keys and values to the encoder? I do think you're correct that they could be passed to the encoder (T5Attention would have to be modified to concatenate them to the present keys and values, as in the decoder). Or you could tag one of the maintainers in this thread.

ZhaofengWu commented 2 years ago

I was referring to the paper which could be implemented by passing past_key_values to the encoder, allowing us to not have to change T5Attention at all. This has been done for e.g. BERT-like models in this way by the authors and other people, but not T5 due to this restriction.

Maybe @LysandreJik since we briefly talked about this on slack?

patrickvonplaten commented 2 years ago

Hey @ZhaofengWu,

Sorry I thought I answered here before. I do agree with what @dblakely says here mostly. Could you maybe open a PR to showcase a bit what changes we would have to do to satisfy your use case? IMO this is very much an edge case and I've never seen anybody use past_key_values states in BERT. The only reason we have this implemented in BertAttention is because BERT can be used as a decoder

ZhaofengWu commented 2 years ago

@patrickvonplaten Sorry for the late reply. I think the only change needed would be to remove this assertion. Re. past work using past_key_values in BERT to provide a prefix, see the original implementation of the prefix tuning paper (https://github.com/XiangLi1999/PrefixTuning), and a follow-up work P-tuning-v2's (https://github.com/THUDM/P-tuning-v2).

ZhaofengWu commented 2 years ago

Oh, and another thing is that currently past_key_values passes to a T5 model is only given to the decoder. This is workaroundable for my purpose by manually computing encoder_outputs, with past_key_values, and giving it to the T5 model. But ideally there can be something like {encoder,decoder}_past_key_values. Though I would understand if you believe this is too hacky and are unwilling to support this.

patrickvonplaten commented 2 years ago

@patrickvonplaten Sorry for the late reply. I think the only change needed would be to remove this assertion. Re. past work using past_key_values in BERT to provide a prefix, see the original implementation of the prefix tuning paper (https://github.com/XiangLi1999/PrefixTuning), and a follow-up work P-tuning-v2's (https://github.com/THUDM/P-tuning-v2).

Ok if we just need to change the assertion to a warning than I think that this would be fine :-) Would you like to open a PR for this?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yushuinanrong commented 2 years ago

Oh, and another thing is that currently past_key_values passes to a T5 model is only given to the decoder. This is workaroundable for my purpose by manually computing encoder_outputs, with past_key_values, and giving it to the T5 model. But ideally there can be something like {encoder,decoder}_past_key_values. Though I would understand if you believe this is too hacky and are unwilling to support this.

Hi Zhaofeng,

I'm also trying to apply the prefix-prompt idea to T5 model and facing the issues similar to what you had. Could you elaborate a bit more on how did you pass "past_key_values" to the T5 encoder?

Thanks, JC

ZhaofengWu commented 2 years ago

I ended up writing a whole stack of T5ModelWithPrefix/T5EncoderWithPrefix/... I hoped it would be easier but I don't think the current transformers setup allows that.

yushuinanrong commented 2 years ago

I ended up writing a whole stack of T5ModelWithPrefix/T5EncoderWithPrefix/... I hoped it would be easier but I don't think the current transformers setup allows that.

Hi Zhaofeng, would it be possible to share your implementations? I couldn't either make it work based on the current transformers' implementation for T5. Or could you shed more lights on what core changes you made?

BAOOOOOM commented 2 years ago

I ended up writing a whole stack of T5ModelWithPrefix/T5EncoderWithPrefix/... I hoped it would be easier but I don't think the current transformers setup allows that.

Hello, I also have the same problem, could you share your implementations?Thank you very much.

ZhaofengWu commented 2 years ago

Sure, I should be able to share it along with some other project-specific code after the EMNLP decisions, which will be in less than a month. I will let you know when that happens.

BAOOOOOM commented 2 years ago

Thank you very much.

------------------ 原始邮件 ------------------ 发件人: "huggingface/transformers" @.>; 发送时间: 2022年9月7日(星期三) 上午10:59 @.>; @.**@.>; 主题: Re: [huggingface/transformers] T5 encoder past_key_values (Issue #15591)

Sure, I should be able to share it along with some other project-specific code after the EMNLP decisions, which will be in less than a month. I will let you know when that happens.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

cuthalionn commented 1 year ago

In case someone is still looking for T5 with prefix-tuning implementation: https://github.com/HKUNLP/UnifiedSKG/blob/main/models/prompt/modeling_t5.py

BAOOOOOM commented 1 year ago

thanks

In case someone is still looking for T5 with prefix-tuning implementation: https://github.com/HKUNLP/UnifiedSKG/blob/main/models/prompt/modeling_t5.py

Thanks!

ZhaofengWu commented 1 year ago

@yushuinanrong @BAOOOOOM Sorry for the delay, but if it's still helpful, our implementation can be found at https://github.com/allenai/better-promptability/blob/main/better_promptability/models/t5_with_prefix.py