create_pretraining_data.py is writing 0 records to tf_examples.tfrecord

anidiatm41 commented 4 years ago

I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code :+1:

!python create_pretraining_data.py \ --input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt \ --output_file=/content/sample_data/tf_examples.tfrecord \ --vocab_file=/content/sample_data/sifi_13sep-vocab.txt \ --do_lower_case=True \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --masked_lm_prob=0.15 \ --random_seed=12345 \ --dupe_factor=5

Getting output as :

INFO:tensorflow: Reading from input files INFO:tensorflow: Writing to output files INFO:tensorflow: /content/sample_data/tf_examples.tfrecord INFO:tensorflow:Wrote 0 total instances

Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?

I am using TF version 1.12

anidiatm41 commented 4 years ago

FYI..generated vocab file is 290 KB.

blueberry-cake commented 4 years ago

How big is the input file? It happened to me before that I had too many documents in one file and it just didn't write anything to the tfrecord file. Maybe try to split the input file into smaller ones and see if it works!

oniondai commented 3 years ago

i have the same problem, how to solve it?

google-research / bert

create_pretraining_data.py is writing 0 records to tf_examples.tfrecord #1147