huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.02k stars 798 forks source link

`limit_alphabet=1000` is unreasonable in some languages #1004

Closed kaisugi closed 8 months ago

kaisugi commented 2 years ago

When I trained my own tokenizer using BertWordPieceTokenizer in Japanese, I found that [UNK] tokens frequently appear after the tokenization. It took me some time to conclude that this was due to the default value of limit_alphabet.

As you know, there are some languages that consist of a huge number of characters (or alphabets) in the world, such as Chinese and Japanese. In fact, more than 2,000 Kanjis are designated as "daily-use Kanjis (常用漢字)" in Japanese. Considering this, I think the default parameter limit_alphabet=1000 is not an appropriate number.

My suggestion:

A: eliminate the default value of the parameter in BertWordPieceTokenizer and some other implementation classes B: set the default value as a larger one (e.g., 10,000) C: issue a visible warning log when the total number of characters in the corpus exceeds the value

Narsil commented 2 years ago

Hi @HelloRusk ,

I fully understand your frustration if you spent time on this.

As you know, there are some languages that consist of a huge number of characters (or alphabets) in the world, such as Chinese and Japanese.

I am extremely aware of this and try to alert of this (and the fact that those are not space separated which another assumption many people do when assessing tokenizers)

limit_alphabet can be set to None actually if you don't want it and it is the default for BpeTrainer which is the recommended way to train a new tokenizer: https://huggingface.co/docs/tokenizers/quicktour

BertWordPieceTokenizer do exist but mainly aim to reproduce Bert (which is English only afaik) training. It does add limit_alphabet because of this actually, because without it on data that includes international characters, you will get bad English tokenization because all tokens will be eaten up by Chinese/Japanese/Thai and other unicode intensive langages. Casting them to [UNK] for those is actually desired behavior.

That being said, I can see that you spent time and effort, and lost a bit of time on this. Maybe you could share where you got the documentation from and or tutorial to train your tokenizer ? I think a caveat should be mentionned that different langages require different attention and not all defaults work for all langages.

Would that help ?

kaisugi commented 2 years ago

Thank you so much for your thorough answer!

Your explanation explains well why limit_alphabet is needed for (English) BERT. Surely, when training it, including all the international characters wouldn't work!

As a matter of fact, the Japanese-speaking world does not share the best practices for creating language models in the first place. So, I was somehow trying to use BertWordPieceTokenizer without referring to any article in particular.

I would like to learn from my mistake and write a Japanese article about this parameter so that others will not make the same mistake in the future. Still, I personally think it would be more user-friendly to have some kind of warning output when the total number of characters exceeds the the value.

Anyway, thanks!

Narsil commented 2 years ago

Still, I personally think it would be more user-friendly to have some kind of warning output when the total number of characters exceeds the the value.

I see !, we'll think about it.

As a matter of fact, the Japanese-speaking world does not share the best practices for creating language models in the first place. So, I was somehow trying to use BertWordPieceTokenizer without referring to any article in particular.

Please share whenever you're done ! I would love more resources to share people whenever non European languages issues are not taken sufficiently taken into account. Thinking space is a universal separator is the biggest culprit in my book, but I know there are more challenges associated with Japanese for instance (I think jieba/jieba-rs is used a a pre_tokenizer for instance) but as I don't practice those langages daily, I don't necessarily have in the top of my head all difference with English heavy tokenizers.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.