Open MarwahAhmadHelaly opened 4 years ago
Hi. I am puzzled with the same question and want to share some thoughts.
I think that using number a of words may be misleading, because BERT is trained on tokens rather than words. A token may correspond either to a word, to a part of word or to a single letter. So, we may need to multiply to additional parameter of mean_tokens_per_word
which I gues could be as big as 2. This could ruin the approximate equivalence mentioned in the deduction.
Number of instances consumed in 1 batch most probably refers to the duplicated instances, because model works with the dataset in *tfrecord format which is a result of duplication. Actually it is not about just copying-and-pasting the examples, but rather about how many random samples do we produce from the original wiki+bookcorpus dataset.
Will be glad to see the author's opinion.
I have encoded the calculation of the number of epoches as a part of my bert-training automation attempt https://github.com/stagedml/stagedml/blob/master/run/bert_pretrain/out/Report.md#appendix-a-number-of-pre-training-epoches
I think for your second question, one data sample in the batch is referring to one single instance in the duplicates. i.e. in your example, if your batch size is 50, then all the 500 instances will span across 10 batches. I did simple experiments on this, you can refer my colab notebook on https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/blob/master/misc/Albert_batch_behavior_deepdive.ipynb
And for your question 1, after reading the create pretrain file script, I understand why they use the fomular to deduce # of epoches.
Because every data sample in a batch is concatenation of smaller sentences, until it reaches max_seq_length (which is 512 for official bert) for majority of times (only with small probability, it will generate shorter sequences). Hence, they use that fomular to appximate # of epoches.
Hello,
I would like to confirm how the number of training steps and hence the number of epochs used in the paper for pretraining BERT is calculated.
From the paper, I deduced (kindly correct me if I am mistaken):
_training_steps = #_desired_epochs * (#words(not_sentences)_in_all_input_corpus / #tokens_per_batch)
where #tokens_per_batch = batch_size * max_seq_len
So using the numbers in the paper: 1,000,000 ~ 40 (3,300,000,000 / (256512))
Question # 1) Is this deduction correct?
Also, to my understanding, batch_size represents the "number of training instances consumed in 1 batch".
If we assume:
Question # 2) Is the "number of training instances consumed in 1 batch" concerned with original 100 or duped 500 instances?
Thank you!