BERT and other models pretraining from scratch example

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.31k stars 26.35k forks source link

BERT and other models pretraining from scratch example #4425

Closed hairzooc closed 3 years ago

hairzooc commented 4 years ago

Hi, I've been finetuning lots of tasks using this repo. Thanks :) But I couldn't find any pretraining from scratch examples. Please let me know if you guys have any advices on that. It would be very helpful for me to do my research.

miketrimmel commented 4 years ago

https://huggingface.co/blog/how-to-train

hairzooc commented 4 years ago

Thank you for your swift reply :) How about Electra model? Is it possible to pretrain from scratch as well?

miketrimmel commented 4 years ago

Did you read the article? Section 3

hairzooc commented 4 years ago

Yup, I've read Section 3. :) As long as I know Electra uses replaced token detection with discriminator and generator (GAN style). That's why I thought that there could be something different from BERT-like masked lm. And I found the open issue below as well.

https://github.com/huggingface/transformers/issues/3878

miketrimmel commented 4 years ago

I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.

Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.

hairzooc commented 4 years ago

I got it. Thank you for your effort!

amy-hyunji commented 4 years ago

@miketrimmel Hi, Is there still a bug if I try to train electra from scratch using run_language_modeling.py or it is available now? Thanks!

miketrimmel commented 4 years ago

I had issues with the tb_writer. i tried it for new now and there were no issues with the writer any more(maybe I had an old version). If you´re using a pretrained tokenizer it should work now. Training a new tokenizer is not supported. I have to say I´m new into the tokenization things. I´m training a Twitter language model from scratch so i wasn´t sure if the model will perform as good with the pretrained tokenizer (can be that there is a lot of vocabulary missing because of the "Twitter-slang"). So I trained a custom tokenizer. I will verify the different tokenizers the next days. I will also provide the model and tokenizer when its finished if someone wants to fine-tune it on his Twitter task.

amy-hyunji commented 4 years ago

Great! Thanks for explanation :)

glakshmidhar commented 4 years ago

@miketrimmel Could you please share the code for pretraining electra from scratch?

miketrimmel commented 4 years ago

Yes, I will share it the next days here. Actually I´m busy with other things and I have to make it pretty before :D

dongdongyang-houzz commented 4 years ago

Could i know what's the meaning of "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it, and load it from here, using --tokenizer_name" @miketrimmel

Could i use a tokenizer from https://github.com/huggingface/tokenizers for initiation? I'd like to train a model from scratch.

miketrimmel commented 4 years ago

Yes you could use a tokenizer from https://github.com/huggingface/tokenizers. But there is no batch_encode_plus method. I used the solution from another issue https://github.com/huggingface/tokenizers/issues/259 here. The solution with the wrapper from @theblackcat102 worked for me.

LysandreJik commented 4 years ago

There is code for training ELECTRA from scratch still undergoing testing here https://github.com/huggingface/transformers/pull/4656

It's still under development but it pretty stable now.

ddofer commented 4 years ago

I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.

Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.

Any chance you could share the code? I've been trying to do this myself, but am failing at getting results (whether in finetuning, or in running electra with TF in HF). Thanks!

zy329jy commented 4 years ago

can you give me some advices about how to pretrain the bart model on my own dataset? thank you soooooo much

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mlcom commented 3 years ago

Detailed Explanation https://mlcom.github.io/

apkbala107 commented 3 years ago

I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.

Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.

location is currently not available...please share the exact location

apkbala107 commented 3 years ago

Detailed Explanation https://mlcom.github.io/Create-Language-Model/

location is currently not available...please share the exact location

mlcom commented 3 years ago

Detailed Explanation https://mlcom.github.io/Create-Language-Model/

location is currently not available...please share the exact location

mlcom.github.io