Closed hairzooc closed 3 years ago
Thank you for your swift reply :) How about Electra model? Is it possible to pretrain from scratch as well?
Did you read the article? Section 3
Yup, I've read Section 3. :) As long as I know Electra uses replaced token detection with discriminator and generator (GAN style). That's why I thought that there could be something different from BERT-like masked lm. And I found the open issue below as well.
I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.
Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.
I got it. Thank you for your effort!
@miketrimmel Hi, Is there still a bug if I try to train electra from scratch using run_language_modeling.py or it is available now? Thanks!
I had issues with the tb_writer. i tried it for new now and there were no issues with the writer any more(maybe I had an old version). If you´re using a pretrained tokenizer it should work now. Training a new tokenizer is not supported. I have to say I´m new into the tokenization things. I´m training a Twitter language model from scratch so i wasn´t sure if the model will perform as good with the pretrained tokenizer (can be that there is a lot of vocabulary missing because of the "Twitter-slang"). So I trained a custom tokenizer. I will verify the different tokenizers the next days. I will also provide the model and tokenizer when its finished if someone wants to fine-tune it on his Twitter task.
Great! Thanks for explanation :)
@miketrimmel Could you please share the code for pretraining electra from scratch?
Yes, I will share it the next days here. Actually I´m busy with other things and I have to make it pretty before :D
Could i know what's the meaning of "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it, and load it from here, using --tokenizer_name" @miketrimmel
Could i use a tokenizer from https://github.com/huggingface/tokenizers
for initiation? I'd like to train a model from scratch.
Yes you could use a tokenizer from https://github.com/huggingface/tokenizers. But there is no batch_encode_plus method. I used the solution from another issue https://github.com/huggingface/tokenizers/issues/259 here. The solution with the wrapper from @theblackcat102 worked for me.
There is code for training ELECTRA from scratch still undergoing testing here https://github.com/huggingface/transformers/pull/4656
It's still under development but it pretty stable now.
I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.
Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.
Any chance you could share the code? I've been trying to do this myself, but am failing at getting results (whether in finetuning, or in running electra with TF in HF). Thanks!
can you give me some advices about how to pretrain the bart model on my own dataset? thank you soooooo much
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Detailed Explanation https://mlcom.github.io/
I modified https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py script few days ago for training electra from scratch. But there were some problems(maybe bugs) i had to solve for this task.
Currently I´m setting up a clean running version for training a electra language model from scratch with an additional document classification head based on the script.
location is currently not available...please share the exact location
Detailed Explanation https://mlcom.github.io/Create-Language-Model/
location is currently not available...please share the exact location
Detailed Explanation https://mlcom.github.io/Create-Language-Model/
location is currently not available...please share the exact location
mlcom.github.io
Hi, I've been finetuning lots of tasks using this repo. Thanks :) But I couldn't find any pretraining from scratch examples. Please let me know if you guys have any advices on that. It would be very helpful for me to do my research.