Hi all,
I try to generate the pretraining corpus for BERT with pregenerate_training_data.py. In the BERT paper, it reports about 6M+ instances(segment A+segmentB, less than 512 tokens). But I get 18M instances, which is almost 3 time than BERT uses. Does anyone have any idea on the result and does anyone know if I need to process WikiPedia and BookCorpus first and then try to generate training instances? Thanks very much in advance!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi all, I try to generate the pretraining corpus for BERT with pregenerate_training_data.py. In the BERT paper, it reports about 6M+ instances(segment A+segmentB, less than 512 tokens). But I get 18M instances, which is almost 3 time than BERT uses. Does anyone have any idea on the result and does anyone know if I need to process WikiPedia and BookCorpus first and then try to generate training instances? Thanks very much in advance!