train_new_from_iterator fails in non-space separated languages

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.69k stars 747 forks source link

train_new_from_iterator fails in non-space separated languages #1395

Closed frotaur closed 5 months ago

frotaur commented 7 months ago

I've been training tokenizers based on the 'gpt2' pretrained tokenizer, using train_new_from_iterator.

When training on languages such as chinese or japanese, memory explodes after 'pre-processing', and I suspect it is because of the fact that words are not space separated.

Am I missing a quick fix or this is a deeper issue ?

ArthurZucker commented 6 months ago

Yes, I think for languages like these you need to either manually split the documents / sentences or use a pre-tokenizers like the https://huggingface.co/facebook/xglm-564M does!

frotaur commented 6 months ago

But what is the reason for this ? Not splitting on spaces should add a number of pairs which is of the order of the number of words in a dataset, so why does it take so much more memory ? Feels to me that pure BPE without pre-tokenization should still work, or am I missing something ?

ArthurZucker commented 6 months ago

Initially, the BPE vocabulary is initialized with each character (or byte) as a separate subwords. This means that each character in the training data is considered a distinct subword unit. In english you would have a really small amount of characters, but in Chinese this starts at 10K. However if you split the text before then it can be less than that, but mostly the training will also require less ram. Since the size of the vocabulary grows dynamically during training as new subword units are created. In languages with explicit word boundaries, the vocabulary may be relatively smaller compared to languages without such boundaries (e.g., Chinese).

Chinese has a vast vocabulary, with many thousands of characters, each of which can be combined with others to form compounds. This leads to a large number of possible subwords, which in turn requires more memory to store and explore.

Hope this makes sense?

frotaur commented 6 months ago

This makes sense why its harder than alphabet languages. However, within that explanation, I guess word boundaries wouldnt help much. Adding boundaries will still give a number of pairs which is of the same order of magntidue, so I guess it would still take huge amounts of ram.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.