huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
130.84k stars 26.03k forks source link

Add MEGA #19982

Closed mnaylor5 closed 1 year ago

mnaylor5 commented 1 year ago

Model description

MEGA introduces a new attention method which incorporates gating and exponential moving averages to create strong local dependencies, reducing the need for full softmax attention. MEGA set a new SOTA on Long Range Arena, and MEGA-chunk performs nearly as well while achieving linear complexity WRT sequence length. I have seen really promising results from my own experiments with MEGA on long documents -- both in efficiency and model performance. It would be awesome to have MEGA (+ MEGA-chunk) available in the Hugging Face ecosystem!

Open source status

Provide useful links for the implementation

NielsRogge commented 1 year ago

I have seen really promising results from my own experiments with MEGA on long documents

Cool! Could you elaborate?

It would be very useful indeed, especially if there would be pre-trained weights for longer sequence tasks, like summarization of very long texts, or classifying multiple images (this has been asked a lot for LayoutLM-like models, which only operate on single document images).

However I'm not seeing very useful pre-trained weights at the moment, would be very useful to have a BERT-like, Wav2Vec2 or ViT-like checkpoint for operating on long sequences

mnaylor5 commented 1 year ago

Thanks for the quick response @NielsRogge!

I have only experimented with MEGA in a long-document classification setting so far, and I trained the full architecture from scratch without using the pre-trained weights. I used the authors' implementation and set up a BERT-style document classification class, using similar architectural details as in the Text task for LRA (Appendix D), but with encoder_chunk_size=16.

For performance details in my initial experiment: I used roughly 7k documents in training and 2k in validation, with up to ~3k tokens in a document. Using a single T4 GPU, each epoch (train + eval) averaged ~22 seconds. This is quite a bit faster than I've seen with other linear-complexity attention mechanisms, and I suspect it's largely due to the significant decrease in model size (4-6 layers with a single attention head in each). It's hard to compare model performance since I trained fully from scratch, but MEGA certainly seemed to reach competitive performance for my task.

I agree that the currently available model weights aren't the most generally useful, and that a BERT-like encoder would be great. I'm not sure if the authors intend to release something like that, but if not, hopefully the speed gains reduce the barrier for community LM contributions.

MarkRich commented 1 year ago

Hey @mnaylor5! Apologies if this is implied, but are you working on contributing this model or just indicating it would be great to have? I'd be happy to help implement in the transformers repo if the latter (or in either case if you'd be interested!). I have some 3090s to throw at this, though perhaps this isn't enough compute?

In any case, excited to see if I can help & to see this get added to HF!

mnaylor5 commented 1 year ago

Hi @MarkRich - no worries, I definitely could have been clearer. At this point, I am mainly just saying that it would be great to have available in the Hugging Face ecosystem. I'd love to contribute, but I doubt I can realistically commit the time over the next few weeks at least. I put up the issue in case anyone from the HF team or community got excited about implementing it 😄

MarkRich commented 1 year ago

Sweet, I can take a crack at it. @NielsRogge is there any chance I can get added to a slack channel or something similar so that I can ask questions? My email address is mark.rich388@gmail.com

NielsRogge commented 1 year ago

Sure, I'll create a channel and send you an invite.

lingjzhu commented 1 year ago

My research involves the MEGA model. Is there any way that I can contribute to this? Happy to make it available on HuggingFace!

NielsRogge commented 1 year ago

Hi,

That'd be great. Could you provide your email address? I'll add you to the Slack channel

lingjzhu commented 1 year ago

Thank you! My email is lingjzhu at umich.edu

lingjzhu commented 1 year ago

@NielsRogge Hi, this is a gentle follow-up about adding MEGA. Could I start to work on it now?

lingjzhu commented 1 year ago

@NielsRogge Nevermind. I have joined. Thank you!

mnaylor5 commented 1 year ago

Hi there! I was able to set aside some time to pretrain a very basic Mega model using BERT-style masked language modeling. I know this was something that @NielsRogge mentioned as being more useful, so I hope these pretrained weights will be helpful for getting Mega into transformers!

I used the official Mega implementation (specifically the MegaEncoderLayer class) and pretrained on wikitext-103 - nothing earth-shattering, but hopefully helpful. :smile: The model specs and code I used for training are in this Colab notebook along with code for loading classes and weights; and the weights and tokenizer are saved in this repo on the HF model hub.

mnaylor5 commented 1 year ago

Hi there @lingjzhu @MarkRich @NielsRogge - any update on how this is going? I've been using the Mega architecture (from the original implementation) more in my own experiments, and I am super excited about using it more within the HF ecosystem.

I might have some time to help with the implementation of Mega into Transformers over the next few weeks, so I would be happy to contribute to any ongoing efforts or take a stab at contributing it myself.

lingjzhu commented 1 year ago

Hi there @lingjzhu @MarkRich @NielsRogge - any update on how this is going? I've been using the Mega architecture (from the original implementation) more in my own experiments, and I am super excited about using it more within the HF ecosystem.

I might have some time to help with the implementation of Mega into Transformers over the next few weeks, so I would be happy to contribute to any ongoing efforts or take a stab at contributing it myself.

@mnaylor5 That would be nice. I have been working on the text version and have an initial WIP codebase. However, due to interruptions by some life events, I haven't completed it yet. I will upload it to my github this weekend and maybe we can work together to complete it.

mnaylor5 commented 1 year ago

@lingjzhu cool, no worries! I'll get started and look forward to checking out your code 😄

mnaylor5 commented 1 year ago

@NielsRogge - apologies if there's a better place to ask this, or if I'm missing some documentation that explains this. The Mega paper includes experiments on encoder-only tasks (text and image classification) as well as seq2seq (machine translation, language modeling with encoder-decoder). Is there a preference from the HF team on how to structure these separate approaches? My own work with Mega has been within encoder-only settings (pre-training with masked LM and fine-tuning on sequence or token classification), so I'm inclined to start by implementing it similarly to BERT, but I wasn't sure if this would be a problem.

lingjzhu commented 1 year ago

@mnaylor5 My WIP code is here. The code is in the src/transformers/models/src but it still could not run at the moment.

I have started by copying the code for T5 model and using mega as a drop-in replacement for the attention module. That said, I have moved all mega-related code from the official repo to modeling_mega.py and am now fusing them together with the pretrained_model class. Given that T5 has both an encoder and a decoder, it would be great to implement them all in one. I think most of the existing code can be reused. Maybe we could coordinate and finish the rest of the work?

Once the implementation is ready, I can pretrain an encoder, a decoder, and an encoder-decoder model on a medium size dataset and push them to the hub.

mnaylor5 commented 1 year ago

Thanks @lingjzhu! I ended up doing a similar pure PyTorch reimplementation of the original Mega code - after doing that and reading through the Hugging Face documentation, I think I have a solid understanding for how to proceed. Even though a large part of the Mega architecture is the EMA-based attention, it probably makes sense to implement the full Mega blocks that they propose (including the normalized feed-forward layer) rather than dropping in the EMA portion into another architecture like T5. This approach will keep the implementation in line with what the Mega paper introduces, and using T5 as a base would also make it more difficult to work within encoder-only settings like document classification.

With this in mind and in response to my own question above, I think it makes the most sense to approach the Mega implementation similarly to BigBird, which is conceptually similar to the improvements offered by Mega - efficiency improvements over standard self-attention which can be used in encoder-only, decoder-only, and seq2seq settings. The BigBird implementation follows the approach of BERT, which sets things up in a way that allows BigBirdModel to be used as either an encoder or decoder based on the provided config. If my understanding is correct, the extension to seq2seq is then handled by Hugging Face's EncoderDecoderModel class.

I have gotten started by using the add-new-model-like command and starting from RoBERTa (since I used a RoBERTa tokenizer in the MLM pretraining in my earlier comment), and I'm working through the implementation now.

One question for @NielsRogge / the Hugging Face team: the original implementation of Mega does not include token type embeddings - it does not preclude their usage, but their tasks did not use token type embeddings. I'm afraid that tasks like QA would be difficult to implement without these embeddings, but including them would introduce a divergence from any of the model checkpoints currently available from the original repo (including the ones I linked above from the BERT-style encoder). Do you have a recommended way of approaching this?

NielsRogge commented 1 year ago

Hi,

Some models like DistilBERT also don't support token_type_ids and they work just fine (thanks to the SEP token). But feel free to add support for token type ids, it can't hurt using them :)

mnaylor5 commented 1 year ago

@NielsRogge thanks for the quick response. That makes sense, and I'll add support for them 😄

Tylersuard commented 1 year ago

@mnaylor5 You are a saint for posting that Colab! I have been looking to train Mega too. @NielsRogge How is it coming, integrating MEGA into Huggingface?

Tylersuard commented 1 year ago

@mnaylor5 I am getting this error on your colab: 5 frames

/content/./mega/fairseq/modules/moving_average_gated_attention.py in forward(self, x, padding_mask, incremental_state, need_weights, attn_mask, before_attn_fn) 303 # B x L x S -> B x K x C x S 304 nc = seq_len // self.chunk_size --> 305 q = q.reshape(bsz, nc, self.chunk_size, self.zdim) 306 307 if ctx_len < self.chunk_size:

RuntimeError: shape '[32, 621, 2, 64]' is invalid for input of size 2545664

Do I need to add some padding and the padding mask?

NielsRogge commented 1 year ago

Hi,

MEGA is now available here: https://huggingface.co/docs/transformers/main/model_doc/mega

mnaylor5 commented 1 year ago

@Tylersuard Yep, you can use MEGA in the main branch of Transformers - that PR was merged just a couple of weeks ago.

I haven't dug into your specific error, but I'd guess that you're using chunking and need to pad inputs to a multiple of your chunk size