kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
226 stars 47 forks source link
bert natural-language-processing vocabulary-builder

Vocabulary builder for BERT

Modified, simplified version of text_encoder_build_subword.py and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.


Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to vocab.txt in opened model.
And the libraries they suggested to use were not compatible with their tokenization.py of BERT as they mentioned.
So I modified text_encoder_build_subword.py of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.

Modifications

Requirement

The environment I made this project in consists of :

Basic usage

python subword_builder.py \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}