Fixing alphabet not working

Hey there!

I am trying to train a tokenizer with BertWordPieceTokenizer. I use an iterator that gives the text and tokenizer.train_from_iterator.

After training the tokenizer I realized that there are many unnecessary characters in the data, like Chinese, greek, or super special characters that I do not need.

I tried fixing the alphabet like this:

alphabet = list("abcdefghijklmnopqrstuvxyz.....")

tokenizer.train_from_iterator(
        iterator=data_gen(),
        vocab_size=30000,
        min_frequency=2,
        limit_alphabet=len(alphabet),
        initial_alphabet=alphabet
)

However, it does not affect the output. Are limit_alphabet and initial_alphabet intended for this? If not, what are their purposes? I could also remove unwanted characters in the generator.

huggingface / tokenizers

Fixing alphabet not working #765