kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
226 stars 47 forks source link

is there a format for corpus_filepattern? #12

Closed YuBeomGon closed 4 years ago

YuBeomGon commented 4 years ago

Hi, thank you for your sharing. I am trying to make vocab.txt like below for IMDB moview review dataset. python3 subword_builder.py --corpus_filepattern IMDB_review.txt --output_filename vocab.txt --min_count 30000 WARNING:tensorflow:From subword_builder.py:81: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W0304 18:26:35.470829 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:133: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

['./IMDB_review.txt'] WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

W0304 18:26:35.492865 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/tokenizer.py:138: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

19.23373532295227 for reading read file : ./IMDB_review.txt read all files WARNING:tensorflow:From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0304 18:26:54.772613 140030738089792 module_wrapper.py:139] From /home/beomgon2/albert/bert-vocab-builder/text_encoder.py:588: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Iteration 0 I0304 18:26:54.772828 140030738089792 text_encoder.py:588] Iteration 0 INFO:tensorflow:vocab_size = 668 I0304 18:26:59.560518 140030738089792 text_encoder.py:660] vocab_size = 668 INFO:tensorflow:Iteration 1 I0304 18:26:59.560930 140030738089792 text_encoder.py:588] Iteration 1 INFO:tensorflow:vocab_size = 378 I0304 18:27:02.865697 140030738089792 text_encoder.py:660] vocab_size = 378 INFO:tensorflow:Iteration 2 I0304 18:27:02.866119 140030738089792 text_encoder.py:588] Iteration 2 INFO:tensorflow:vocab_size = 403 I0304 18:27:06.409686 140030738089792 text_encoder.py:660] vocab_size = 403 INFO:tensorflow:Iteration 3 I0304 18:27:06.409908 140030738089792 text_encoder.py:588] Iteration 3 INFO:tensorflow:vocab_size = 397 I0304 18:27:10.208530 140030738089792 text_encoder.py:660] vocab_size = 397 INFO:tensorflow:Iteration 4 I0304 18:27:10.208930 140030738089792 text_encoder.py:588] Iteration 4 INFO:tensorflow:vocab_size = 399 I0304 18:27:13.905530 140030738089792 text_encoder.py:660] vocab_size = 399 total vocab size : 456, 19.1799635887146 seconds elapsed INFO:tensorflow:vocab_size = 456 I0304 18:27:13.912348 140030738089792 text_encoder.py:686] vocab_size = 456

but vocab size is very small? whats wrong?

IMDB_review.txt I thought this was wonderful way to spend time on too hot summer weekend sitting in the air conditioned theater and watching light hearted comedy The plot is simplistic but the dialogue is witty and the characters are likable even the well bread suspected serial killer While some may be disappointed when they realize this is not Match Point Risk Addiction thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love This was the most d laughed at one of Woody comedies in years dare say decade While ve never been impressed with Scarlet Johanson in this she managed to tone down her sexy image and jumped right into average but spirited young woman This may not be the crown jewel of his career but it was wittier than Devil Wears Prada and more interesting than Superman great comedy to go see with friends . Basically there a family where little boy Jake thinks there a zombie in his closet his parents are fighting all the time This movie is slower than soap opera and suddenly Jake decides to become Rambo and kill the zombie OK first of all when you re going to make film you must Decide if its thriller or drama As drama the movie is watchable Parents are divorcing arguing like in real life And then we have Jake with his closet which totally ruins all the film expected to see BOOGEYMAN similar movie and instead watched drama with some meaningless thriller spots out of just for the well playing parents descent dialogs As for the shots with Jake just ignore them .

And my tensorflow version is 1.15

YuBeomGon commented 4 years ago

Now I have a reason. I misunderstand that min count is the minimum size of vocabulary. thank you

kwonmha commented 4 years ago

Closed as it seems to have been solved.