I am trying to train a tokenizer with BertWordPieceTokenizer.
I use an iterator that gives the text and tokenizer.train_from_iterator.
After training the tokenizer I realized that there are many unnecessary characters in the data, like Chinese, greek, or super special characters that I do not need.
However, it does not affect the output. Are limit_alphabet and initial_alphabet intended for this? If not, what are their purposes? I could also remove unwanted characters in the generator.
Hey there!
I am trying to train a tokenizer with
BertWordPieceTokenizer
. I use an iterator that gives the text andtokenizer.train_from_iterator
.After training the tokenizer I realized that there are many unnecessary characters in the data, like Chinese, greek, or super special characters that I do not need.
I tried fixing the alphabet like this:
However, it does not affect the output. Are
limit_alphabet
andinitial_alphabet
intended for this? If not, what are their purposes? I could also remove unwanted characters in the generator.