google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.17k stars 9.6k forks source link

Create and Load my own Pre-Training data from Scratch #1121

Open dhimasyoga16 opened 4 years ago

dhimasyoga16 commented 4 years ago

I'm a student and i'm currently doing a research about "Q&A System using BERT". I'm also new to NLP. From some of sources (paper, github forum, etc) i've read i knew that BERT has its own Pre-Trained data which is obtained from large corpus such as Wikipedia.

What i want to know is how can i create my own pre-train data? Because in my Q&A research the data i'm using are from Quora which are in Indonesian language. If i need to run any file, which file should i run?

Thank you so much in advance.

dhanushanthp commented 4 years ago

@dhimasyoga16 Follow this steps https://github.com/google-research/bert#pre-training-with-bert So the basic steps like,

  1. Create Tfrecord data using ur text file that contains the sentence in each line
  2. Training your own model using google bert pre-trained models initial weights --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
dhimasyoga16 commented 4 years ago

Hi, thank you for the answer.

Can i ask some more questions? What are the steps i have to do in order to develop a question answer system in BERT?

From what I concluded after reading a number of journals, the process might be like this:

  1. Prepare data
  2. Generate Pre-Train Data
  3. Cluster the Pre-Trained Data
  4. Train the model/data
  5. Testing
  6. Evaluation

Are those steps correct? Also, what does Pre-Train process used for? Once again, thank you so much in advance. And i'm sorry if i'm asking too many questions.