huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.99k stars 26.53k forks source link

Add FAVOR+ / Performer attention #7675

Open marrrcin opened 3 years ago

marrrcin commented 3 years ago

🌟 FAVOR+ / Performer attention addition

Are there any plans to add this new attention approximation block to Transformers library?

Model description

The new attention mechanism with linear time and space complexity was introduced in Rethinking Attention with Performers [https://arxiv.org/abs/2009.14794]. Authors of the paper claim that the new attention mechanism is backward-compatbile with already existing models

Backwards compatibility with pretrained models is available as a benefit from softmax approximation, via small finetuning (required due to error propagation)

Open source status

bratao commented 3 years ago

Just for reference, there is two open-source MIT implementations in pytorch.

https://github.com/lucidrains/performer-pytorch And https://github.com/idiap/fast-transformers

simonjanin commented 3 years ago

This could prove particularly important for longer sequences like protein sequences and long texts. High level overview at https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html.

JaeDukSeo commented 3 years ago

if this could be implemented it would be dope!

norabelrose commented 3 years ago

It would be nice to make it possible to use FAVOR+ in combination with the pretrained models that use softmax attention— at least the popular ones like BERT. Or even better, someone would just do the fine tuning for the common pretrained models and then we could make those available out of the box. I should be able to do that for DistilBERT since I plan to be using DistilBERT + FAVOR for a project soon.

norabelrose commented 3 years ago

Just started a fork to work on this at https://github.com/norabelrose/transformers-plus-performers. Is it okay with everyone if I implement it by creating a new file implementing FAVOR+ multihead attention (maybe one file for the PyTorch implementation and one for the TF implementation), then adding an option to BertConfig and DistilBertConfig (and maybe other model config classes) allowing the user to select FAVOR+ as the attention implementation?

It just seems sort of silly and wasteful to create multiple entirely new models for this when FAVOR+ has backwards compatibility.

Also since FAVOR+ is an unbiased estimator of full softmax attention, it should be possible to have an option that would tell the model to dynamically switch between FAVOR+ and full attention at test time depending on the sequence length. This would be desirable since FAVOR+ is slower than softmax attention when the sequence is shorter than O(d*log(d)), where d is the number of dimensions per attention head. Implementing such dynamic switching would be easier and more elegant if FAVOR+ is just a config option and not a new model class.

marcoabrate commented 3 years ago

Any update on the implementation of this new architecture? @norabelrose

norabelrose commented 3 years ago

@marcoabrate The initial implementation is complete at https://github.com/norabelrose/transformers-plus-performers/blob/performers/src/transformers/modeling_performer_attention.py. Haven't been able to test it yet because getting my hands on the right datasets for finetuning DistilBERT with Performer attention, preprocessing the data, etc. has proven to be a huge ordeal. Should hopefully be able to do it today though.

norabelrose commented 3 years ago

UPDATE: The most recent commit on my transformers-plus-performers repo is now up and running. Right now I only changed DistilBertModel and DistilBertConfig to enable them to use Performer attention (just set attention_type='performer'), but it should be quite trivial to add the feature to other models.

As I type this I'm fine-tuning the distilbert-base-uncased pretrained model to work with Performer attention by distilling it against bert-base-uncased. You should be able to just directly fine-tune it with MLM but I figured that distillation might get you better results. It seems to be converging rather quickly but I haven't been running it for long and I only have one GPU to work with.

I would welcome other people taking a look at my repo and submitting pull requests to it.

djstrong commented 3 years ago

FAVOR+ is slower than softmax attention when the sequence is shorter than O(d*log(d)), where d is the number of dimensions per attention head

What are those numbers for DistilBERT, BERT-base and BERT-large?

Did you compare real speed?

norabelrose commented 3 years ago

I haven't had a chance to compare the difference on actual models yet, but I should be able to do that in the next day or two.

I have, however, tested the speed difference between softmax attention and FAVOR+ on random Gaussian matrices. FAVOR+ really starts to get faster when the sequence length is ~18 times larger than d*ln(d), at least on my GPU. With BERT settings (d_model = 768, num_heads = 12) that means about 5000 tokens. Captura de Pantalla 2020-11-24 a la(s) 12 46 57 a  m Captura de Pantalla 2020-11-24 a la(s) 12 41 12 a  m Captura de Pantalla 2020-11-24 a la(s) 12 42 01 a  m

This is basically because you have to matrix-multiply Q and K by the random feature matrix, which you don't have to do for softmax attention. You get better results with Performer when (d_model / num_heads) is smaller:

Captura de Pantalla 2020-11-24 a la(s) 12 48 23 a  m

I should mention that while FAVOR+ might be slower than softmax for some of these "medium" sequence lengths, it should still be using less memory than softmax, since it isn't allocating that L x L attention matrix. So there's somewhat of a memory-time tradeoff here.

The numbers I show above are from my own implementation of FAVOR+, but I also tried it with the performer_pytorch implementation and got almost identical results. Really, FAVOR+ is an attention mechanism for long sequences. It's got this great unique property that it's an unbiased estimator of softmax attention. That means that you can easily use it with models that were pretrained on softmax attention, and you can switch between FAVOR+ and softmax at inference time. And that's why it should be part of Huggingface.

norabelrose commented 3 years ago

UPDATE: While I have Performer up and running with DistilBertModel, I've run into a problem that I didn't even think about when I started. DistilBERT, BERT, RoBERTa, and several other models use learned positional embeddings, which impose a fixed 512-token max sequence length. In order to process sequences longer than 512 tokens, and thereby get the benefits of Performer attention, we'll need to use some other type of positional embeddings; for maximum flexibility, probably fixed sinusoidal embeddings with some large max sequence length. We could also try using relative position embeddings, although AFAIK no one has actually tried doing that with Performer attention and I would need to think about it a bit to figure out if that's actually feasible. DistilBertModel actually already comes with a sinusoidal_pos_embds option, but this option is overridden when you load the weights from a pretrained model.

It's not clear how hard it would be to finetune a pretrained model that was trained with learned positional embeddings to use fixed sinusoidal ones, or if it would even be worth it— it may be necessary to just train them from scratch, especially since we are also trying to swap out the attention mechanism. I'll try finetuning soon and see what happens. But it's looking more likely that we won't be able to just plug in the already existing checkpoints like we initially hoped. If that turns out to be the case, it would be really great if someone with access to more than one GPU could do the training from scratch and upload the models :)

PS: After @djstrong's comment about FAVOR+'s performance on relatively short sequences, I wanted to get to the bottom of why FAVOR+ was so much slower until you get up to around 5000 tokens. Oddly enough, it turns out that the torch.max() operation which is used to generate the numerical stabilizer for the exp() kernel was the main culprit. When you don't use a stabilizer, Performer attention starts beating softmax attention at much shorter sequence lengths. So I added an option in PerformerAttentionConfig to turn off the stabilizer.

guotong1988 commented 3 years ago

https://github.com/huggingface/transformers/issues/8893

Tensorflow code, not jax. Thank you.

norabelrose commented 3 years ago

@guotong1988 as of about half an hour ago, my fork now has a TensorFlow implementation: https://github.com/norabelrose/transformers-plus-performers/blob/performers/src/transformers/modeling_tf_performer_attention.py.

I have not had a chance to test it at all. If someone else could at least try getting it working on their own system that would be great. Pull requests are welcome.

tomweingarten commented 3 years ago

Hey @norabelrose , I'm part of the Performer team at Google, it's great to see this getting added to huggingface! Would you be open to meeting so we can discuss how we can work together on this? If anyone else is interested in joining the meeting please comment here and I'll reach out to coordinate.

norabelrose commented 3 years ago

@tomweingarten Sure! Send me an email at belrose.nora@gmail.com and we can set up a time to talk in the next couple weeks. As I mentioned above, the basic implementation in PyTorch and TensorFlow is done but we need to write unit tests and make sure everything is working properly.

Also, in my fork at transformers-plus-performers I had to make a few minor changes to other parts of HuggingFace in order to get training to run smoothly on my machine— in particular, the distillation example program, since I initially tested PerformerAttention by continuing distillation of a pretrained DistilBERT model with Performer attention against bert-base. The implementation of distillation on master loads all the training data at once into RAM, which blows up on my meager hardware. I changed it so that you can load the training data incrementally. That's probably a good thing to add to the master branch, but arguably it should be put in a separate pull request. So we'll have to change that, a long with a couple other little things.

I'd recommend you check out my fork at https://github.com/norabelrose/transformers-plus-performers/. The relevant files are /src/transformers/configuration_performer_attention.py, /src/transformers/modeling_performer_attention.py, and /src/transformers/modeling_tf_performer_attention.py. I also changed the BERT and DistilBERT model and config files so the user can use Performer attention with them. I'll accept pull requests on that repo.

PS: Also just realizing that the definition of short_sequence_behavior on PerformerAttentionConfig in the last commit is defined variously as Union[str, dict], Union[str, Callable], or Union[str, tuple]— sorry about that, I wasn't really sure how best to implement that feature. Right now the actual implementation in PerformerAttention assumes it's a str or Callable.

marcoabrate commented 3 years ago

@tomweingarten @norabelrose I would like to participate in the meeting too, if possible. I am working with long sequences for summarization. I have not had the chance to go through the code thoroughly yet, but I am ready to help soon.

Edit: you can reach me at marco.abrate@epfl.ch

TwinMooon commented 3 years ago

@norabelrose Is there any plan to support unidirectional attention ?

xingyousong commented 3 years ago

Hi guys, thanks to @kchoro and @ValeryTyumen on the Performers team, we've open-sourced the Tensorflow version of FAVOR+ here: https://github.com/google-research/google-research/tree/master/performer/fast_attention/tensorflow

BTW, we've edited the folder name and code to be fast_attention now rather than fast_self_attention.

Please let us know how well it works in your pipelines!

norabelrose commented 3 years ago

UPDATE: The new default branch ("clean") on my fork at https://github.com/norabelrose/transformers-plus-performers/ now has all the extraneous changes I made to the upstream removed. I also merged in all new commits from upstream.

@TwinMooon Yes, we should be able to add causal attention. I was under the impression that it would be necessary to include a custom CUDA kernel from the fast-transformers library to compute the prefix sums— since that's what the performer_pytorch implementation does, which I used as a template for my implementation— but now looking at the Google code in both Jax and TensorFlow I realize that they just compute the prefix sums in Python code and then use a custom gradient. So it looks like it's not necessary, although it's possible that using the CUDA kernel gives you a noticeable speed boost.

norabelrose commented 3 years ago

I'd like to set a goal of making an official pull request to add this to master by the end of the year. I haven't been able to do that yet because I've been busy with school and other projects, and I haven't gotten any help from other contributors. Key things that need to be done are:

As always, any and all help with these tasks is welcome.

norabelrose commented 3 years ago

@TwinMooon Update: I got causal attention working by translating the Google implementation, but as I feared, it's very slow since it doesn't use a custom CUDA kernel. On my GPU, it's 19-20 times slower than noncausal attention. But there might be a way around this; I'll have to think about it.

In the meantime, I think I'm going to add an optional dependency on the fast_transformers package (just wrapping the import statement in a try... except block) to get access their custom CUDA kernel. I'll include a warning if the user doesn't have it installed that causal attention might have bad performance without the package. That's what the performer_pytorch package does.

TwinMooon commented 3 years ago

@norabelrose For the past two days, I have implemented a version of causal attention by just translating Google's TensorFlow implementation. After reading your code, I found that our implementation is quite similar. However, The causal version runs a little faster than the non-casual version in my machine. My PyTorch version is 1.5.0 and run it in a 2080Ti with CUDA 10.0

norabelrose commented 3 years ago

@TwinMooon Ok cool! If you wouldn’t mind submitting a pull request to my fork or just copy and pasting the relevant block of code here then I could check to see if your version is faster. It’s possible that I’m making some silly mistake.

I’m running it on a GeForce GTX 1080 with PyTorch 1.4.0 and CUDA 10.0.0. It was also noticeably a lot slower than noncausal attention on my CPU only laptop which has PyTorch 1.7.

PS: Is it possible that you got the tensor shapes mixed up? The Google implementation expects tensors of shape [length, batch, heads, random features/embedding dim] while everywhere else it's usually [batch, heads, length, random features/embedding dim], so you have to permute the tensor dimensions. The code will actually run if you give it tensors with the [B, H, L, D] shape though, so I got tripped up on that when I first translated the Google code and it made it look like it was faster than it actually was. If you're using a small batch size of say, 1 or 5, it'll be a lot faster to compute prefix sums over the batch dimension than doing it over the sequence length dimension of size 512 (which is what it's actually supposed to do).

TwinMooon commented 3 years ago

@norabelrose You can review my implementation here. I permuted the tensor shape before stuff into the casual attention.

norabelrose commented 3 years ago

@TwinMooon In your code, you spell the word "causal" two different ways: "causal" and "casual". You use the "causal" spelling in the forward() method where short_sequence_behavior indicates to use softmax attention, and then you use casual everywhere else.

Is it possible that you're initializing the PerformerAttention object sort of like this: PerformerAttention(PerformerAttentionConfig(d_model=768, num_heads=12), causal=True) so that the "casual" attribute remains its default value of False, and none of the causal attention code ever actually gets called? I should probably change __init__ so it that it always throws an error when you include a nonexistent attribute in kwargs.

In other news, I figured out a sort of clever way of making causal attention like 2x faster, and that's in my latest commit.

norabelrose commented 3 years ago

Mark Zakharov made a Colab where he successfully finetuned a DistilBERT model with the most recent version of my fork, which you can check out here: https://colab.research.google.com/drive/1BUYk4qxdt1b3d5mx6_t0nnX5jP9KwVAv?usp=sharing

I think the project is almost ready to be made into a formal pull request.

TwinMooon commented 3 years ago

@norabelrose cool! I'll try it now.

patrickvonplaten commented 3 years ago

This is really great work guys! We are currently running some experiments on the flax version of Performer internally and looking into how to best integrate the model into Transformers. @norabelrose a PR in PyTorch and or Tensorflow would be amazing!

tomweingarten commented 3 years ago

Excited to see the progress here! Just wanted to give a heads-up that we fixed a significant bug in our TF implementation of Performer fast attention.

norabelrose commented 3 years ago

Pull request finally submitted: #9325

benathi commented 3 years ago

This is great! Thank you for your hard work! :) I was wondering if it would be trivial to extend this to support encoder-decoder models such as Bart or T5? Does the method `init_performer_attention' currently work for cross attention?

norabelrose commented 3 years ago

@benathi It should be quite simple. You'll just need to read through the implementations of BART and T5 and 1) find what name they are using for their query, key, value, and output linear layers so that PerformerAttention can mimic the naming convention and 2) find the immediate parent module of the attention module so you can put @init_performer_attention() on its __init__ method with the appropriate parameters. Sometimes models will roll, for example, LayerNorm into the attention module which means a little bit of refactoring might be needed in order for PerformerAttention to be dropped in as a replacement. The inconsistency in implementation across models is the only thing that prevents this from being 100% trivial.

kchoro commented 3 years ago

Hi Guys,

Happy New Year ! Thank you for your great work ! I wonder whether it would make sense to meet soon to discuss where we are in terms of integration, etc. :)

P.S

One quick observation on my side. In our experiments we found settings, where Performer's approximate softmax was the best, but also applications, where Performer-ReLU (that does not use random features) was outperforming other Performers variants. Performers enable those different attention variants simply via different functions for creating kernel features (we will be actually adding some more kernel features makers to the open-sourced version very soon). We think about both Performer approximate softmax and Performer-ReLU (both already open-sourced) as good defaults and which of them is to be chosen should be probably determined by the experiment in the particular setting under consideration. Also, ultimately it would be exciting if one can find new kernel feature functions that would outperform them. Performers are flexible and can be applied even with those future variants of kernel feature makers that we are not aware of right now :) So I think that modularizing the code so that one can easily plug in her/his kernel feature maker (while still having available good default variants) would be very attractive for new users and would encourage people to further develop the codebase.

Best,

Krzysztof

śr., 30 gru 2020 o 13:43 Nora Belrose notifications@github.com napisał(a):

@benathi https://github.com/benathi It should be quite simple. You'll just need to read through the implementations of BART and T5 and 1) find what name they are using for their query, key, and value linear layers so that PerformerAttention can mimic the naming convention and 2) find the immediate parent module of the attention module so you can put @init_performer_attention() on its init method with the appropriate parameters. The inconsistency in implementation across models is the only thing that prevents this from being a 100% trivial drop-in type thing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7675#issuecomment-752720169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF443BRIAOPHZWPSDDGU7IDSXNYEXANCNFSM4SJ7J5BQ .

norabelrose commented 3 years ago

As per the suggestion by @kchoro , I just added the ability to pass in custom Callables to the kernel_type config parameter.

gaceladri commented 3 years ago

Hello,

Amazing work @norabelrose! I have been trying your performer implementation. I have copied your attention implementation PerformerAttention and have replaced that attention with the normal self-attention in Mobilebert. I have tracked some metrics with respect to other implementations. I have seen that the memory consumption on 512 tokens long it consume the same memory that the normal self attention. And it is also the same fast.

I have logged the metrics with Wandb: https://wandb.ai/gaceladri/new_berts/reports/Memory-and-speed-comparison--Vmlldzo0NDA4MTI

Does that makes sense? I have seen in Long Range Arena https://arxiv.org/abs/2011.04006 that it is 1.2x faster with 1k tokens but I have not tried with that long. The point where I am confused is with the memory consumption. At shorter values, the attention mechanism, being linear with respect to sequence length, not should be consuming less memory?

tomweingarten commented 3 years ago

Hi @gaceladri , can you tell us what your hyperparameters are set to for the model dimensions and number of random features? Those will both affect the scale of memory and computation. At short sequence lengths (512) you may not see any benefit in memory or speed. There's more detail on the computational complexity and how it depends on these hyperparameters in the paper.

gaceladri commented 3 years ago

@tomweingarten

hidden_size = 128, layers=8, intermediate_size=128, embedding_size=128, max_position_embeddings=512.

I have looked at the paper. You are right, in your paper, it is reported that in short sequences the timing should not be better. In the long range arena they start from 1000 tokens onwards and it is 1.2x faster than normal attention.

Thanks a lot for the clarification!

tomweingarten commented 3 years ago

"It is easy to see that such a mechanism is characterized by space complexity O(Lr + Ld + rd) and time complexity O(Lrd) as opposed to O(L^2 + Ld) and O(L^2 d) of the regular attention (see also Fig. 1)."

At that size I would expect the O(Lr) term of the Performer space complexity to dominate, and is comparable to L^2 assuming your number of features is set to 256. Since your feedforward dimensionality is so small the other factors will largely drop out except for constants. So your result looks pretty normal, but let us know if you see unexpectedly large memory usage when scaling it bigger along the sequence dimension!

ksrinivs64 commented 3 years ago

@norabelrose, thanks for the very nice work! There seems to be a merge conflict in init of transformers now though.

Neo9061 commented 3 years ago

UPDATE: While I have Performer up and running with DistilBertModel, I've run into a problem that I didn't even think about when I started. DistilBERT, BERT, RoBERTa, and several other models use learned positional embeddings, which impose a fixed 512-token max sequence length. In order to process sequences longer than 512 tokens, and thereby get the benefits of Performer attention, we'll need to use some other type of positional embeddings; for maximum flexibility, probably fixed sinusoidal embeddings with some large max sequence length. We could also try using relative position embeddings, although AFAIK no one has actually tried doing that with Performer attention and I would need to think about it a bit to figure out if that's actually feasible. DistilBertModel actually already comes with a sinusoidal_pos_embds option, but this option is overridden when you load the weights from a pretrained model.

It's not clear how hard it would be to finetune a pretrained model that was trained with learned positional embeddings to use fixed sinusoidal ones, or if it would even be worth it— it may be necessary to just train them from scratch, especially since we are also trying to swap out the attention mechanism. I'll try finetuning soon and see what happens. But it's looking more likely that we won't be able to just plug in the already existing checkpoints like we initially hoped. If that turns out to be the case, it would be really great if someone with access to more than one GPU could do the training from scratch and upload the models :)

PS: After @djstrong's comment about FAVOR+'s performance on relatively short sequences, I wanted to get to the bottom of why FAVOR+ was so much slower until you get up to around 5000 tokens. Oddly enough, it turns out that the torch.max() operation which is used to generate the numerical stabilizer for the exp() kernel was the main culprit. When you don't use a stabilizer, Performer attention starts beating softmax attention at much shorter sequence lengths. So I added an option in PerformerAttentionConfig to turn off the stabilizer.

Hi @norabelrose and @tomweingarten, Just wonder based on your experiments, to use the model (bert+performer attention) for long sequence of text, do we need pre-train a bert + performer attention from scratch given the position embeddings are trainable and the # of it is only up to 512 in a pretrained bert-base? Or is there any tricks we can do to load a pre-trained bert-base and directly insert performer attention during the fine-tuning? for example, change the learned position embedding to sinusoidal ones and disgard the pretrained weights for position embeddings from bert-base.

norabelrose commented 3 years ago

@Neo9061 Sorry for taking a while to respond.

I never actually tried this, but it based on this documentation from DeepSpeed it sounds like the best way to finetune a pretrained model with learned positional encodings on sequences that are longer than it was trained on is to simply duplicate the pretrained encodings N times: https://www.deepspeed.ai/tutorials/sparse-attention/. So I would try that before switching over to fixed sinusoidal embeddings or anything else.

That said, I recommend against using Performer attention in general and especially the implementation of it in this fork, since it isn't maintained. Imo better solutions for long sequences would be the Longformer or BigBird implementations already merged into master in this library, which can go up to 4096 tokens, or using the DeepSpeed library & utilities to retrofit Sparse Attention onto pretrained models from transformers.

tomweingarten commented 3 years ago

@Neo9061 The DeepSpeed approach sounds reasonable to me, though I haven't tried it myself. If you're able, I'd recommend doing a small pre-training round whenever you "uptrain" from one model to another -- in this case that could allow you to re-learn the position encoding and also adjust the attention weights to move from softmax to the Performer softmax approximation.

kchoro commented 3 years ago

Hi Guys,

Regarding relative positional encoding with Performers, this can be done in several different ways now and there are lots of papers published recently demonstrating this, for example:

https://arxiv.org/abs/2105.08399

It is a very simple trick that in practice works very well. If you want to finetune with the Performer variant a model pretrained with learned positional encodings, another option would be to freeze your pretrained positional encoding in finetuning stage and concatenate with other features. This can be done for instance by doing SVD of the learned positional embedding mask:

QK^T + M = QK^T + AB^T = [Q|A][K|B]^T (and you apply favor to the last expression)

Best,

Krzysztof

śr., 2 cze 2021 o 18:28 Xin Huang @.***> napisał(a):

UPDATE: While I have Performer up and running with DistilBertModel, I've run into a problem that I didn't even think about when I started. DistilBERT, BERT, RoBERTa, and several other models use learned positional embeddings, which impose a fixed 512-token max sequence length. In order to process sequences longer than 512 tokens, and thereby get the benefits of Performer attention, we'll need to use some other type of positional embeddings; for maximum flexibility, probably fixed sinusoidal embeddings with some large max sequence length. We could also try using relative position embeddings, although AFAIK no one has actually tried doing that with Performer attention and I would need to think about it a bit to figure out if that's actually feasible. DistilBertModel actually already comes with a sinusoidal_pos_embds option, but this option is overridden when you load the weights from a pretrained model.

It's not clear how hard it would be to finetune a pretrained model that was trained with learned positional embeddings to use fixed sinusoidal ones, or if it would even be worth it— it may be necessary to just train them from scratch, especially since we are also trying to swap out the attention mechanism. I'll try finetuning soon and see what happens. But it's looking more likely that we won't be able to just plug in the already existing checkpoints like we initially hoped. If that turns out to be the case, it would be really great if someone with access to more than one GPU could do the training from scratch and upload the models :)

PS: After @djstrong https://github.com/djstrong's comment about FAVOR+'s performance on relatively short sequences, I wanted to get to the bottom of why FAVOR+ was so much slower until you get up to around 5000 tokens. Oddly enough, it turns out that the torch.max() operation which is used to generate the numerical stabilizer for the exp() kernel was the main culprit. When you don't use a stabilizer, Performer attention starts beating softmax attention at much shorter sequence lengths. So I added an option in PerformerAttentionConfig to turn off the stabilizer.

Hi @norabelrose https://github.com/norabelrose and @tomweingarten https://github.com/tomweingarten, Just wonder to use the model for long sequence of text, do we need pre-train a bert + performer attention given the position embeddings are trainable and the # of it is only up to 512 in bert-base? Or is there any tricks we can do to load a pre-trained bert-base and directly insert performer attention during the fine-tuning? for example, change the learned position embedding to sinusoidal ones and disgard the pretrained weights for position embeddings from bert-base.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/7675#issuecomment-853424091, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF443BRACTJWEP2ZIJTMCK3TQ2WBXANCNFSM4SJ7J5BQ .

ideasbyjin commented 2 years ago

I'm really excited by this potential addition! What is the timeline on integration into HF?

patrickvonplaten commented 2 years ago

A working checkpoint with Performer would really help ;-)

ideasbyjin commented 2 years ago

Thanks! I thought the idea behind the Performer was that it's more about a methodology / attention technique, than it is about something pre-trained right (or at least, that's what I gathered from the paper).