huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.87k stars 26.98k forks source link

CharacterBERT #9061

Open helboukkouri opened 3 years ago

helboukkouri commented 3 years ago

🌟 New model addition

Model description

CharacterBERT is a variant of BERT that uses a CharacterCNN module instead of WordPieces. As a result, the model:

  1. Does not require/rely on a WordPiece vocabulary
  2. Produces a single embedding for any (reasonable) input token
  3. Is more robust to misspellings

Paper: https://www.aclweb.org/anthology/2020.coling-main.609/

Open source status

I am willing to work on a PR but I will probably need some guidance 😊

stefan-it commented 3 years ago

After reading the paper again, I'm really excited to pre-train models for another domain 🤗 do you know when the pre-training code will be available 🤔

helboukkouri commented 3 years ago

@stefan-it glad to hear that you enjoyed our work. I haven't released the pre-training code yet as it is not as user friendly as I would want it to be. But it just happens that I'm planning to work on releasing a first version some time this week, so good timing 😊.

You can subscribe to the following issue if you want to be notified: https://github.com/helboukkouri/character-bert/issues/4

Cheers!

LysandreJik commented 3 years ago

Sounds great @helboukkouri! Let us know if we can help in any way, we'd love to see character BERT in transformers!

stefan-it commented 3 years ago

Hey @helboukkouri , really cool PR for the upcoming model integration :hugs:

I've already looked at it, and have a question about the CharacterMapper implementation. So in the current implementation it supports a maximum word length of 50 (so all word representations are padded to this length, if I'm correctly reading it). Do you think it would decrease training (and later fine-tuning) time, when using a smaller value :thinking:

So e.g. in German we could have really long words such as "Bezirksschornsteinfegermeister", but 50 is really long (but I think this is language-dependend).

helboukkouri commented 3 years ago

Hey @stefan-it, thanks! 😊

Do you think it would decrease training (and later fine-tuning) time, when using a smaller value 🤔

When we compute some stats around model speed, we find that while CharacterBERT is twice as slow as BERT during pre-training (108% slower), it is not as slow during downstream task fine-tuning (19% on avg.) This means that most of the "slowness" happens during pre-training, which makes us think that the Masked Language Modeling output layer is at fault here. In particular, the differences with BERT are: (1) no parameter sharing between the wordpiece embedding matrix and the output layer and (2) a larger output layer (we use top 100k tokens in the pre-training corpus as a vocabulary) since we want to be able to predict a reasonably high number of tokens so that MLM can be beneficial.

So to answer your question: reducing the maximum word length might reduce overall speeds but this change will probably negligible when compared to the effects listed above.

You may wonder why we used 50 character long representations. To be honest, we didn't want to tweak this CharacterCNN too much as it is originally the same layer that is used in ELMo. We just trusted the guys from AllenAI to have done a good work choosing the architecture and just re-used it 😄

stefan-it commented 3 years ago

Hi @helboukkouri thanks for your detailed answer! This explains the whole training time/speed topic really great :hugs:

helboukkouri commented 3 years ago

After reading the paper again, I'm really excited to pre-train models for another domain 🤗 do you know when the pre-training code will be available 🤔

Code is out! Feel free to open issues if you have any problems using it.

pradeepsinghnitk commented 1 year ago

Hi @helboukkouri, I have read the paper with great interest. I am currently working on the same topic. I tried to reproduce the result with our custom data. We could complete phase 1. Now we are heading towards fine-tuning of pretrained model for MLM and NSP tasks. Would you consider sharing research materials for the same.

helboukkouri commented 1 year ago

Hi @pradeepsinghnitk, thanks for your interest.

Could you be more specific about what you mean by phase 1 and also if by fine-tuning of pretrained model for MLM and NSP tasks you mean pre-training or actual task-specific finetuning (e.g. on text classification tasks)?

In any case, check this code as it gives basic context for loading a model and running an inference. Fine-tuning it on any task should be straightforward (as you would with BERT basically) : https://github.com/helboukkouri/character-bert

And for NSP and MLM (which is usually what is called pre-training), the code is here: https://github.com/helboukkouri/character-bert-pretraining

Unfortunately, the import of CharacterBERT in the transformers library did not really succeed. It's been a while but if I remember well the issues were related to the different tests failing due to character-based tokenization being not very well supported at the time.

I'll notify everybody if I ever go back to working on this again.

Cheers!

pradeepsinghnitk commented 1 year ago

Thank you for your response. To be specific about phase 1; bash $WORKDIR/bash_scripts/run_pretraining.character_bert.step_1.sh (phase 1: maximum input length of 128 and maximum number of masked tokens per input of 20.) we could successfully execute this for char_bert pertaining and also for bert_based pretraining. Now, we would like to reproduce https://github.com/helboukkouri/character-bert-finetuning. But there was no code uploaded here.

"And for NSP and MLM (which is usually what is called pre-training), the code is here: https://github.com/helboukkouri/character-bert-pretraining". this part of the scripts we have already executed

pablorp80 commented 1 year ago

Looking forward to this integration since December 2020!

RealNicolasBourbaki commented 1 year ago

@stefan-it Hi Stefan, I saw it on your twitter account that you finished training German version of CharacterBERT. It is not on Huggingface yet, but I am writing my master thesis on OCR post correction on historical german corpus, and can really use it! Can you tell me how I can have access to your model? Thank you so much! Greetings from Stuttgart!

heyarny commented 8 months ago

Is it still not supported by transformers?

helboukkouri commented 8 months ago

Unfortunately no. I opened a PR a couple of months ago: https://github.com/huggingface/transformers/pull/26617 But didn't get the chance to work on it since so it got closed again :/