Community contribution - `BetterTransformer` integration for more models!

younesbelkada commented 1 year ago

`BetterTransformer` integration for more models!

BetterTransformer API provides faster inference on CPU & GPU through a simple interface!

Models can benefit from very interesting speedups using a one liner and by making sure to install the latest version of PyTorch. A complete guideline on how to convert a new model has been created on the BetterTransformer documentation!

Here is a list of models that could be potentially supported, pick one of the architecture below and let's discuss about the conversion!

Text models 🖊️ :

[x] FSMT - FSMTEncoderLayer / @Sumanth077 https://github.com/huggingface/optimum/pull/494
[ ] MobileBERT - MobileBertLayer / @raghavanone https://github.com/huggingface/optimum/pull/506
[x] MBart - MBartEncoderLayer + M2M100EncoderLayer / https://github.com/huggingface/optimum/pull/516 @ravenouse
[ ] ProphetNet - ProphetNetEncoderLayer
[x] RemBert - RemBertLayer / @hchings https://github.com/huggingface/optimum/pull/545
[ ] RocBert - RocBertLayer / @shogohida https://github.com/huggingface/optimum/pull/542
[ ] RoFormer - RoFormerLayer
[x] Tapas - TapasLayer / https://github.com/huggingface/optimum/pull/520

Vision models 📷 :

[ ] Detr - DetrLayer
[ ] Flava - FlavaLayer / https://github.com/huggingface/optimum/pull/538
[x] GLPN - GLPNLayer (cannot be supported)
[x] ViLT - ViLTLayer / https://github.com/huggingface/optimum/pull/508

Audio models 🔉 :

[ ] Speech2Text - Speech2TextLayer
[ ] NEW: Audio Speech Transformer - ASTLayer / @ravenouse https://github.com/huggingface/optimum/pull/548

Let us also know if you think that some architectures can be supported that we missed. Note that for encoder-decoder based models below, we expect to convert the encoder only.

Support for decoder-based models coming soon!

cc @michaelbenayoun @fxmarty

https://github.com/huggingface/optimum/issues/488

hamishdickson commented 1 year ago

NotImplementedError: The Better Transformers implementation for the model DebertaV2Model has not beenimplemented yet. Please open an issue requesting the addition of this model with its BetterTransformerimplementation.

It's not on your list, but would you complain if I did this for DebertaV2Model?

michaelbenayoun commented 1 year ago

It is not in the list because DebertaV2 does not have a regular attention mechanism, so it is not possible to use it with BetterTransformer.

younesbelkada commented 1 year ago

Yes I second what @michaelbenayoun said, please see related: https://github.com/huggingface/optimum/issues/487

hamishdickson commented 1 year ago

makes a lot of sense - sorry I should have thought about that a bit harder before posting!

GenVr commented 1 year ago

I noticed that Better Transformers for the T5 model has not been implemented yet. Will it be implemented in the future (if possible)? Thanks.

younesbelkada commented 1 year ago

Hi @GenVr Thanks a lot for your reply! Unfortunately T5 cannot be supported because of the nature of its attention mechanism. In fact T5 uses attention bias and this is not supported for BetterTransformer Thanks!

RJZauner commented 1 year ago

Hi :) I would like to work on the implementation for RemBertLayer.

What are the next steps in getting started?

Thank you!

younesbelkada commented 1 year ago

Hey @RJZauner ! Thanks so much for your interest in helping us integrating more models for BetterTransformer ! RemBert seems to use the same attention mechanism as BERT, the only difference should be on the Embedding layer, which is fine for us! So I would say you can move ahead and start forking optimum library, create a new branch and open a draft PR. Feel free to have some inspiration from what has been done by https://github.com/huggingface/optimum/pull/494 and https://github.com/huggingface/optimum/pull/508 to see what exactly needs to be done ;) Ping us (myself, @michaelbenayoun & @fxmarty) whenever you feel that you need help!

shogohida commented 1 year ago

Hi @younesbelkada, I would like to work on the easiest of the models mentioned above. Which one do you recommend? What I said might sound a bit weird but I want to tackle a simple one since I'm not very familiar with these models 🙏

JuheonChu commented 1 year ago

Hello, I would like to tackle the implementation for TapasLayer.

May I ask you how I can start the further steps?

Thank you for your time.

michaelbenayoun commented 1 year ago

Hi @shogohida and @JuheonChu ,

You can read this page for learning how to contribute. You can then open a PR with your code, and ask questions there, we will be glad to help!

Also @shogohida, I think they are all similar in terms of difficulty, so do not block on that, maybe choose a model with the modality the most familiar to you.

younesbelkada commented 1 year ago

Seconding what @michaelbenayoun said, feel free to check some example PRs https://github.com/huggingface/optimum/pull/508 or https://github.com/huggingface/optimum/pull/494 for reference! @shogohida , you can take RocBERT, actually it copies from Bert so the conversion will be very easy :)

shogohida commented 1 year ago

Thanks guys for your replies! I will take RocBERT then!

JuheonChu commented 1 year ago

Thanks @michaelbenayoun ! I will take TapasLayer !

ravenouse commented 1 year ago

Hi! Thank you so much for opening this issue.

I was implementing the RemBERT and had some questions. But then I noticed that @RJZauner had already been working on that. I am going to hold my work on that and I am looking forward to see RJZauner's implementations!
I will work on the mBART.
I also found some dead links and some points unclear on this page. How should I report and help to solve the problems I found?

blakechi commented 1 year ago

Hello @younesbelkada,

I would like to take DetrLayer. Nice tutorial btw 😀

younesbelkada commented 1 year ago

Hi @blakechi ! Sure you can take it ;) let me know if you need help opening a PR!

younesbelkada commented 1 year ago

Hi @ravenouse ! Thanks for your help! Yes you can take MBART ;) Regarding the dead link could you open an issue at optimum? Thanks!

RJZauner commented 1 year ago

Hey @RJZauner ! Thanks so much for your interest in helping us integrating more models for BetterTransformer ! RemBert seems to use the same attention mechanism as BERT, the only difference should be on the Embedding layer, which is fine for us! So I would say you can move ahead and start forking optimum library, create a new branch and open a draft PR. Feel free to have some inspiration from what has been done by huggingface/optimum#494 and huggingface/optimum#508 to see what exactly needs to be done ;) Ping us (myself, @michaelbenayoun & @fxmarty) whenever you feel that you need help!

Thank you for the info!

lucaspct commented 1 year ago

Hello @michaelbenayoun and @younesbelkada !

First time contributing for me :)

I would like to handle the implementation for Speech2Text

What are the first steps ? Create a PR ?

Thanks in advance.

JuheonChu commented 1 year ago

Hello @michaelbenayoun and @younesbelkada !

First time contributing for me :)

I would like to handle the implementation for Speech2Text

What are the first steps ? Create a PR ?

Thanks in advance.

Hello, I am absolutely sure that they will give you a better suggestion than what I have. I would like to share that it is good to read CONTRIBUTING.md in the transformer repository. I read through every content very carefully and made my first contribution!

lucaspct commented 1 year ago

Hello @michaelbenayoun and @younesbelkada ! First time contributing for me :) I would like to handle the implementation for Speech2Text What are the first steps ? Create a PR ? Thanks in advance.

Hello, I am absolutely sure that they will give you a better suggestion than what I have. I would like to share that it is good to read CONTRIBUTING.md in the transformer repository. I read through every content very carefully and made my first contribution!

Hello @JuheonChu :)

I am definitely have a look at it ! thanks

michaelbenayoun commented 1 year ago

Hi @lucaspct,

Yes the first steps would be to read the guide explaining how to contribute to optimum.bettertransformer, and then opening a PR on Optimum, we will support you there!

miyu386 commented 1 year ago

Hi @younesbelkada @michaelbenayoun I'd love to take on the RoFormer model if it isn't claimed yet. Will open a PR after I read through the guide!

adit299 commented 1 year ago

I would like to take a crack at the ProphetNet encoder if it has not been claimed yet

younesbelkada commented 1 year ago

Thank you very much @miyu386 & @adit299 ! Of course yes you can give a try on that ;) feel free to start to open a PR on optimum and we'll guide you from there 💪

ravenouse commented 1 year ago

I would like to work on the ASTLayer if no one has taken it!

katiele47 commented 1 year ago

Hi @younesbelkada I'd like to tackle the FlavaLayer if it has not been taken!

younesbelkada commented 1 year ago

Hi @katiele47 Sure no problem! Feel free to open a PR and tag us there! I will update the table above once the PRs are open ;)

hazrulakmal commented 1 year ago

Hi, @younesbelkada I'd like to take GLPNLayer if no one has claimed it. will open the PR soon for this :)

younesbelkada commented 1 year ago

Hi @hazrulakmal ! Sure! Let me know once you open the PR ;)

stanleycai95 commented 1 year ago

Hi! I'd love to contribute wherever I can be useful.

ravenouse commented 1 year ago

Hi @younesbelkada ! I found the torch._transformer_encoder_layer_fwd() function is called to execute forwarding in our BetterTransformerBaseLayer. To better understand what's going on under the hood, I searched this function online but didn't find much information about it. Could you tell me where I can check the source code of it? Thank you so much!

M0315G commented 1 year ago

Hello, I'd like to work on RocBert layer. I'll go over the contributing guide and open a PR. Anything else I need to go through as a starting point?

hchings commented 1 year ago

Hi @younesbelkada, I added a PR for RemBERT. Since RemBERT's primary changes are in embeddings (decoupling in/output embeddings, increasing output embeddings during pre-training to improve downstream performance, etc), the needed changes should be straightforward. But please kindly let me know if I missed anything.

I also want to apologize to @RJZauner, I realized you've claimed RemBERT after re-reading this thread now. I'll be more careful next time!! And lmk if you want me to withdraw the PR to use your changes instead. If not, feel free to add on top of it and hopefully we can both learn together 🙏🙏.

younesbelkada commented 1 year ago

Hi @M0315G Thanks so much! Unfortunately this is already a WIP from @shogohida here: https://github.com/huggingface/optimum/pull/542 Feel free to take another model that is free ;) also more models will be added in the list as transformers is continuing integrating more models

M0315G commented 1 year ago

Should I take RoFormer then?

younesbelkada commented 1 year ago

Yes, I think that you can take this one. From my understanding of RoFormer it consists of a LM similar as BERT, but uses Rotary positional embedding and the attention mechanism is classic, i.e. similar as BERT. Ideally you can double check that too but for me BetterTransformer should be supported on RoFormer so I would say that you can take it yes

M0315G commented 1 year ago

Understood. Will go through the docs and open a PR in optimum. Any other things I should take care of?

younesbelkada commented 1 year ago

Just take some inspiration from other PRs and everything should go smoothly!

younesbelkada commented 1 year ago

Hi @ravenouse ! From what I got, this function is a C++ binding of the transformer encoder operation that is first defined here and fully defined here as you can see, the whole transformer encoder operations (self attention + ffn) is defined in a single operation

younesbelkada commented 1 year ago

Hi @hchings ! Just reviewed your PR ;) will merge that soon!

hazrulakmal commented 1 year ago

Hi @blakechi I saw that you wanted to work on DertLayer a week ago ? How is your progress now? Just asking cus If you are not working on it anymore, I’m more than happy to help you out with the conversion and take over the PR? :)

younesbelkada commented 1 year ago

Hi @hazrulakmal I realized that DPT model (Depth Estimation model) can be supported too as it uses ViT as a backbone. Would you like to give it a try with this one instead? 🙏 We are currently looking at speeding up this model so this will be a nice addition

miyu386 commented 1 year ago

Hi @younesbelkada, is there any model left that I can work on? Seems like the Detr was claimed a while ago but not sure if there is a PR opened for that.

hazrulakmal commented 1 year ago

@younesbelkada yup! definitely. I can take this up! I checked on the encoder layer and it looks possible to integrate with BT, should I name the new class as DPTViTLayerBetterTransformer?

(encoder): DPTViTEncoder(
    (layer): ModuleList(
      (0): DPTViTLayer(
        (attention): DPTViTAttention(
          (attention): DPTViTSelfAttention(
            (query): Linear(in_features=32, out_features=32, bias=True)
            (key): Linear(in_features=32, out_features=32, bias=True)
            (value): Linear(in_features=32, out_features=32, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): DPTViTSelfOutput(
            (dense): Linear(in_features=32, out_features=32, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): DPTViTIntermediate(
          (dense): Linear(in_features=32, out_features=37, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): DPTViTOutput(
          (dense): Linear(in_features=37, out_features=32, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (layernorm_before): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
        (layernorm_after): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
      )

younesbelkada commented 1 year ago

Very cool! Yes you can name it like this ;) Looking forward to seeing your PR 💪

younesbelkada commented 1 year ago

Hi @miyu386 ViT-Hybrid has just been integrated to transformers would you like to take this one? https://github.com/huggingface/transformers/blob/d151a8c55032d5a21800ea0813c4304af8b8e9f7/src/transformers/models/vit_hybrid/modeling_vit_hybrid.py#L362

RJZauner commented 1 year ago

Hi @younesbelkada, I added a PR for RemBERT. Since RemBERT's primary changes are in embeddings (decoupling in/output embeddings, increasing output embeddings during pre-training to improve downstream performance, etc), the needed changes should be straightforward. But please kindly let me know if I missed anything.

I also want to apologize to @RJZauner, I realized you've claimed RemBERT after re-reading this thread now. I'll be more careful next time!! And lmk if you want me to withdraw the PR to use your changes instead. If not, feel free to add on top of it and hopefully we can both learn together 🙏🙏.

Hey :) don't sweat it - no need to withdraw your PR.

Your implementation looks great - thanks!

miyu386 commented 1 year ago

@younesbelkada Yes, I'd like to give it a try! Thanks

huggingface / transformers