google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.18k stars 9.61k forks source link

How many articles (Wiki+Book corpus) do Bert use in pretraining? #570

Open Qinzhun opened 5 years ago

Qinzhun commented 5 years ago

In the article "Bert: Pretraining of Deep..", It mentions that Wikipedia and Book corpus dataset are used to pretrain. When I try to generate my own data with Wikipedia, I get about 5.5 million articles, and get about 15 million examples with tokens length 512 using the script _create_pretraining_data.py_.

In the article "Bert: Pretraining of Deep..", it mentions 1000000 steps for 40 epochs, with batch size 256, which means 6.4 million examples for pretraining (wiki+bookcorpus). They are very different from my results. So i am coufused whether there are some other measures taken to process the Wikipedia data such as filtering the articles whose length is less than XX ? Or if I use these 15 million examples for pretraining, whether there is a significant influence to my result?

I am thankful to anyone helps~

DecstionBack commented 5 years ago

Hi, I meet the same problem as you. I process the corpus with the pytorch version implementation by huggingface (https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning). They claim that it is implemented according to "create_pretraining_data.py" in tensorflow.

I get 18M training examples with the maximum length 512, I am also confused the preprocessing. I think that the training examples should be larger than the number of documents. The documents in wiki+book is larger than the training examples in BERT paper.

Have you solved the problem? If you have, could you please share how to solve the problem? Thanks you very much. @Qinzhun

Qinzhun commented 5 years ago

Hi, I meet the same problem as you. I process the corpus with the pytorch version implementation by huggingface (https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning). They claim that it is implemented according to "create_pretraining_data.py" in tensorflow.

I get 18M training examples with the maximum length 512, I am also confused the preprocessing. I think that the training examples should be larger than the number of documents. The documents in wiki+book is larger than the training examples in BERT paper.

Have you solved the problem? If you have, could you please share how to solve the problem? Thanks you very much. @Qinzhun

Sorry, I haven't solve this problem.And when I try to use a new frame Mxnet who declares that they can finish pretraining with 8 GPU in 6.5 days, I find the similar problem we discussed here. They also use less training examples than us, which maybe the similar size with the BERT paper. @DecstionBack

roomylee commented 5 years ago

Hi, @Qinzhun. I'm not sure, but I think that a dupe_factor, one of the hyperparameters in create_pretraining_data.py, causes that problem. It is the number of times to duplicate the input data (with different masks).

https://github.com/google-research/bert/blob/d66a146741588fb208450bde15aa7db143baaa69/create_pretraining_data.py#L53

songsuoyuan commented 4 years ago

I also run the create_pretraining_data.py script. My input is the Wikipedia data (12G), there are total 5,684,250 documents. First I split the dataset into 10 smaller files using split command. Then for each file, run the script with dupe_factor = 1, max_seq_len = 128. Finally, I got a training dataset with 33,236,250 instances.

I also check the total words in my Wikipedia data, using the wc command. It shows out the dataset contains 2,010,692,529 words and 110,819,655 lines. This number is less than the number reported in the BERT paper (2500M).

I was very confused by the definition of one epoch used during the pre-training procedure. In my understanding, use dupe_factor = 1 gives one epoch of training set, using dupe_factor = 5 gives five epoch of training sets. Is this understanding correct?

JF-D commented 4 years ago

I have a similar problem...

akanyaani commented 2 years ago

@Qinzhun Did you solve the problem? for one of my projects I am trying to replicate BERT pre-training data