huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.94k stars 779 forks source link

Fixing alphabet not working #765

Closed erksch closed 3 years ago

erksch commented 3 years ago

Hey there!

I am trying to train a tokenizer with BertWordPieceTokenizer. I use an iterator that gives the text and tokenizer.train_from_iterator.

After training the tokenizer I realized that there are many unnecessary characters in the data, like Chinese, greek, or super special characters that I do not need.

I tried fixing the alphabet like this:

alphabet = list("abcdefghijklmnopqrstuvxyz.....")

tokenizer.train_from_iterator(
        iterator=data_gen(),
        vocab_size=30000,
        min_frequency=2,
        limit_alphabet=len(alphabet),
        initial_alphabet=alphabet
)

However, it does not affect the output. Are limit_alphabet and initial_alphabet intended for this? If not, what are their purposes? I could also remove unwanted characters in the generator.

erksch commented 3 years ago

Actually, nevermind! I was just mixing up some files and the alphabet limitation worked fine!