kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
226 stars 47 forks source link

BERT trained on custom corpus #15

Open anidiatm41 opened 3 years ago

anidiatm41 commented 3 years ago

Hi M. H. Kwon, Your tokenization script is really helpful.

I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py ,extract_features.py etc..as a result I got vocab file, .tfrecord file, .jason file and check point files.

Now how to use those file for the below tasks:

  1. to predict a missing word in a given sentence?
  2. for next sentence prediction
  3. Q and A model

Need your help.

kwonmha commented 3 years ago

Hi, anidiatm41, Thank you.

For 3. Q and A model, Visit official bert github. There are instructions about how to do tasks like QA(SQuAD).

Predicting missing words and next sentence prediction are usually used for training. If you want to predict missing words for practical purpose, you need to make your own code. You can refer to evaluation part of run_pretraining.py. It's almost same.