Open anidiatm41 opened 4 years ago
FYI..generated vocab file is 290 KB.
How big is the input file? It happened to me before that I had too many documents in one file and it just didn't write anything to the tfrecord file. Maybe try to split the input file into smaller ones and see if it works!
i have the same problem, how to solve it?
I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code :+1:
!python create_pretraining_data.py \ --input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt \ --output_file=/content/sample_data/tf_examples.tfrecord \ --vocab_file=/content/sample_data/sifi_13sep-vocab.txt \ --do_lower_case=True \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --masked_lm_prob=0.15 \ --random_seed=12345 \ --dupe_factor=5
Getting output as :
INFO:tensorflow: Reading from input files INFO:tensorflow: Writing to output files INFO:tensorflow: /content/sample_data/tf_examples.tfrecord INFO:tensorflow:Wrote 0 total instances
Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?
I am using TF version 1.12