google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.83k stars 9.56k forks source link

BERT pre-training using only domain specific text #615

Open nightowlcity opened 5 years ago

nightowlcity commented 5 years ago

BERT is pre-trained using Wikipedia and other sources of normal text, but my problem domain has a very specific vocabulary & grammar. Is there an easy way to train BERT completely from domain specific data (preferably using Keras)?

The amount of pre-training data is not issue and we are not looking for the SOTA results. We would do fine with a smaller scale model, but it has to be trained from our data.

hsm207 commented 5 years ago

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

PetreanuAndi commented 5 years ago

What if we want to leverage the already pre-trained model (and its language knowledge) and fine-tune on a specific closed-domain dataset? It seems like we need to have the exact same word-embeddings in order for it to leverage the existent knowledge for fine-tuning. How do we know if our word-embeddings match with those used by google in their vocab? What if we have new words in our vocab that were not present in the original trained vocab?

Thank you, much appreciated!

hsm207 commented 5 years ago

The word embeddings are stored in the checkpoint files too. Also the 'words' are actually wordpiece tokens. This kind of tokenization is handled by the create_pretraining_data.py script. So you don't have to worry if your word-embeddings match with those used by google in their vocab.

If you still want to add new words, there are a few issues in this repo discussing this. You can start by reading https://github.com/google-research/bert/issues/396.

PetreanuAndi commented 5 years ago

@hsm207 , i don't worry about not being able to train a model, with the correct input "format". i worry that the "fine-tuning" process is not actually language fine-tuning, but rather new knowledge learning (given that multi-lingual vocabs are quite small, at least for my native language).

Is it not safe to assume that if i am "fine-tuning" on a corpus that has a lot of new words (out-of-original-vocab), then i'm not actually fine-tuning the model, I am in fact re-training from scratch? (given that it has not seen those words before, or has labeled them )

From tokenization script : "if is_bad: # (out of vocab word) output_tokens.append(self.unk_token)"

hsm207 commented 5 years ago

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

PetreanuAndi commented 5 years ago

@hsm207 , "add these words to the vocab" -> does that not imply re-training from scratch on a specific-language corpus? Otherwise, the unknown tokens will just be assigned random values, i guess? (random values => no correlation or relationship between words, bad :( right? )

hsm207 commented 5 years ago

@PetreanuAndi yes, you are right. Adding words to the vocab and then finetuning it to your corpus essentially means training the words from scratch on your domain specific corpus.

PetreanuAndi commented 5 years ago

@hsm207 thank you. i have already begun doing just that, but needed some peer-review confirmation for my actions :) Such a shame tho. The infamous ImageNet moment for NLP, such praise, much awe, it is actually proficient just for english and chinese :)

Scagin commented 5 years ago

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

I try training my own pre-trained model by using the [run_pretraining.py].But I found it can only run in CPU, not GPU. Is it the problem of TPUEstimator? How can I run the code in GPU?

hsm207 commented 5 years ago

How did you figure out it was running on your CPU and not GPU? Was it based on the logs or nvidia-smi?

Anyway, you can try the implementation from Hugging Face. It looks like they have figured out how to run the pretraining using GPUs.

008karan commented 5 years ago

@Scagin facing the same issue. How you solved it?

gsasikiran commented 5 years ago

You can pre-train BERT from scratch using your own data using the run_pretraining.py script.

After pretraining using run_pretraining.py, the model has given checkpoints, but I need the word embeddings. Can I derive word embeddings from check points?

hsm207 commented 5 years ago

@gsasikiran yes, you can. See here for details and adapt as necessary.

gsasikiran commented 5 years ago

@hsm207 Thank you. It worked and resulted in a json file. But that file consists the 16 dictionaries tokens of only [CLS] and [SEP]. I wonder what about the remaining words in the vocabulary. json_file

hsm207 commented 5 years ago

@gsasikiran can you share a minimal and fully reproducible example? I'd like to run your code myself.

gsasikiran commented 5 years ago

https://colab.research.google.com/drive/1ZXn2cVpyvfUscN_-FD1Z_h3HW9xMi_nd

Here I provide the link to the colab with my program. Let me know, if I had to provide my training data and vocab.txt too.

hsm207 commented 5 years ago

@gsasikiran you need to provide everything so that i can reproduce your results.

gsasikiran commented 5 years ago

data_files.zip bert_config_file: bert_config.json input_file: training_data.txt vocab_file:deep_vocab.txt

EDIT: I have removed the training_data, which may subject to copyrights.

gsasikiran commented 4 years ago

@hsm207 Have you got the embeddings?

hsm207 commented 4 years ago

@gsasikiran I have trouble running your notebook.

Specifically I am getting this error:

image

Can you insert into the notebook all the code needed to download the data for your use case too?

gsasikiran commented 4 years ago

https://colab.research.google.com/drive/1ZXn2cVpyvfUscN_-FD1Z_h3HW9xMi_nd

I hope this helps

hsm207 commented 4 years ago

@gsasikiran I don't see a problem with the results. I can view the embeddings for all the tokens in my input.

See the last cell in this notebook: https://gist.github.com/hsm207/143b6349ed1c92960be0dc1c6165d551

gsasikiran commented 4 years ago

@hsm207 Thank you for your time. I have seen the problem, which is my input text. The input text which I have given has many empty lines at the starting and returned only tokens [CLS] and [SEP] token embeddings for those lines.

vr25 commented 4 years ago

Hi,

I am trying to use the domain-specific BERT (FinBERT) for my task but it looks like files such as config.json, pytorch_model.bin, and vocab.txt are missing in the repository. On the other hand, there are two vocabulary files which were created as part of the creating domain-specific pretrained FinBERT.

I was wondering if I can use the above BERT_pretraining_share.ipynb to create the config.json, pytorch_model.bin and then use it as the domain-specific bert-base-uncased model.

Thanks.

gsasikiran commented 4 years ago

@vr25 I have no problem.

ldb-46 commented 4 years ago

What the fuck are you doin????

imayachita commented 4 years ago

Hi @PetreanuAndi and @nightowlcity, Did you manage to pre-train/fine-tune your model on your domain-specific text? Does pre-train from scratch and adding words in the vocab give a significant impact compared to just fine-tune it? Thanks!

viva2202 commented 4 years ago

@PetreanuAndi , @nightowlcity @imayachita I would also be very interested in your experiences. I am also faced with the decision to fine-tune a model or to train a new one from the scratch.

Thank you in advance for sharing your experiences!

Alaminmolla commented 4 years ago

[https://ffii.org/search/FFII/feed/rss2/](https://ffii.org/search/FFII/feed/rss2/) connect with drive to gsuite claud with ca future reference

Alaminmolla commented 4 years ago

W5 h3 responsable

nagads commented 3 years ago

The small vocab isn't really a problem. You will need to see how many unknown tokens you end up with after the word piece tokenization step. If there is a lot and the words are important to your domain, then it is a problem so you will need to add these words to the vocab.

@hsm207 Do you have any suggestions on how much domain corpus data is needed to learn new vocab embeddings , assuming i would leverage on already learnt weights for existing words in current vocab, thanks

hsm207 commented 3 years ago

@nagads I would look to the ULMFiT model for guidance.

They tested the general pretrain -> domain specific pretrain -> task-specific finetuning approach on several datasets:

image

nagads commented 3 years ago

@hsm207 Thanks a ton. This is helpful.

nagads commented 3 years ago

@pkrishnavamshi could you clarify rationale for re-train rather than finetune again on smaller dataset. thanks. re-train in BERT has various connotations.

  1. you want to learn new vocabulary like in case of biobert
  2. you want to finetune Language model but with same vocab to better fit the tone and tenor of domain (like in case of ULMfit)