Closed kaisugi closed 8 months ago
Hi @HelloRusk ,
I fully understand your frustration if you spent time on this.
As you know, there are some languages that consist of a huge number of characters (or alphabets) in the world, such as Chinese and Japanese.
I am extremely aware of this and try to alert of this (and the fact that those are not space separated which another assumption many people do when assessing tokenizers)
limit_alphabet
can be set to None
actually if you don't want it and it is the default for BpeTrainer
which is the recommended way to train a new tokenizer: https://huggingface.co/docs/tokenizers/quicktour
BertWordPieceTokenizer
do exist but mainly aim to reproduce Bert
(which is English only afaik) training. It does add limit_alphabet
because of this actually, because without it on data that includes international characters, you will get bad English tokenization because all tokens will be eaten up by Chinese/Japanese/Thai and other unicode intensive langages. Casting them to [UNK]
for those is actually desired behavior.
That being said, I can see that you spent time and effort, and lost a bit of time on this. Maybe you could share where you got the documentation from and or tutorial to train your tokenizer ? I think a caveat should be mentionned that different langages require different attention and not all defaults work for all langages.
Would that help ?
Thank you so much for your thorough answer!
Your explanation explains well why limit_alphabet
is needed for (English) BERT. Surely, when training it, including all the international characters wouldn't work!
As a matter of fact, the Japanese-speaking world does not share the best practices for creating language models in the first place. So, I was somehow trying to use BertWordPieceTokenizer without referring to any article in particular.
I would like to learn from my mistake and write a Japanese article about this parameter so that others will not make the same mistake in the future. Still, I personally think it would be more user-friendly to have some kind of warning output when the total number of characters exceeds the the value.
Anyway, thanks!
Still, I personally think it would be more user-friendly to have some kind of warning output when the total number of characters exceeds the the value.
I see !, we'll think about it.
As a matter of fact, the Japanese-speaking world does not share the best practices for creating language models in the first place. So, I was somehow trying to use BertWordPieceTokenizer without referring to any article in particular.
Please share whenever you're done ! I would love more resources to share people whenever non European languages issues are not taken sufficiently taken into account. Thinking space is a universal separator is the biggest culprit in my book, but I know there are more challenges associated with Japanese for instance (I think jieba/jieba-rs is used a a pre_tokenizer for instance) but as I don't practice those langages daily, I don't necessarily have in the top of my head all difference with English heavy tokenizers.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
When I trained my own tokenizer using
BertWordPieceTokenizer
in Japanese, I found that[UNK]
tokens frequently appear after the tokenization. It took me some time to conclude that this was due to the default value oflimit_alphabet
.As you know, there are some languages that consist of a huge number of characters (or alphabets) in the world, such as Chinese and Japanese. In fact, more than 2,000 Kanjis are designated as "daily-use Kanjis (常用漢字)" in Japanese. Considering this, I think the default parameter
limit_alphabet=1000
is not an appropriate number.My suggestion:
A: eliminate the default value of the parameter in
BertWordPieceTokenizer
and some other implementation classes B: set the default value as a larger one (e.g., 10,000) C: issue a visible warning log when the total number of characters in the corpus exceeds the value