MarwahAhmadHelaly commented 4 years ago

Hello,

I would like to confirm how the number of training steps and hence the number of epochs used in the paper for pretraining BERT is calculated.

From the paper, I deduced (kindly correct me if I am mistaken):

_training_steps = #_desired_epochs * (#words(not_sentences)_in_all_input_corpus / #tokens_per_batch)

where #tokens_per_batch = batch_size * max_seq_len

So using the numbers in the paper: 1,000,000 ~ 40 (3,300,000,000 / (256512))

Question # 1) Is this deduction correct?

Also, to my understanding, batch_size represents the "number of training instances consumed in 1 batch".

If we assume:

the original number of sequences in my original dataset is 100 (a simple number for sake of easing the explanation) and
we set the dupe_factor in "create_pretraining_data.py" to 5, resulting in a total of approximately 5x100=500 training instances for BERT.

Question # 2) Is the "number of training instances consumed in 1 batch" concerned with original 100 or duped 500 instances?

Thank you!

sergei-mironov commented 4 years ago

Hi. I am puzzled with the same question and want to share some thoughts.

I think that using number a of words may be misleading, because BERT is trained on tokens rather than words. A token may correspond either to a word, to a part of word or to a single letter. So, we may need to multiply to additional parameter of mean_tokens_per_word which I gues could be as big as 2. This could ruin the approximate equivalence mentioned in the deduction.
Number of instances consumed in 1 batch most probably refers to the duplicated instances, because model works with the dataset in *tfrecord format which is a result of duplication. Actually it is not about just copying-and-pasting the examples, but rather about how many random samples do we produce from the original wiki+bookcorpus dataset.

Will be glad to see the author's opinion.

I have encoded the calculation of the number of epoches as a part of my bert-training automation attempt https://github.com/stagedml/stagedml/blob/master/run/bert_pretrain/out/Report.md#appendix-a-number-of-pre-training-epoches

LydiaXiaohongLi commented 4 years ago

I think for your second question, one data sample in the batch is referring to one single instance in the duplicates. i.e. in your example, if your batch size is 50, then all the 500 instances will span across 10 batches. I did simple experiments on this, you can refer my colab notebook on https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/blob/master/misc/Albert_batch_behavior_deepdive.ipynb

LydiaXiaohongLi commented 4 years ago

And for your question 1, after reading the create pretrain file script, I understand why they use the fomular to deduce # of epoches.

Because every data sample in a batch is concatenation of smaller sentences, until it reaches max_seq_length (which is 512 for official bert) for majority of times (only with small probability, it will generate shorter sequences). Hence, they use that fomular to appximate # of epoches.

google-research / bert

BERT pretraining num_train_steps questions #1025

_training_steps = #_desired_epochs * (#words(not_sentences)_in_all_input_corpus / #tokens_per_batch)