huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.9k stars 769 forks source link

does `tokenizer.train_from_iterator` reads all texts into memory? #1170

Closed Maxlinn closed 1 year ago

Maxlinn commented 1 year ago

hi to the community!

recently i'm training a BPE tokenizer with an existing large corpus(reading them all into memory is not feasible).

the corpus was not common one-text-per-line file (for example, several .tar files which consists of text files), so i created a generator that reads files underneath, do proceesings on-the-fly and yields a string. then i use tokenizer.train_from_iterator to train a BPE tokenizer.

however, in the cli Pre-processing sequences grows and the memory usage is growing, until reaching the OS limit to receive a SIGKILL.

i am pretty sure the generator only holds small data chunks each time, it could not possibly causing the memory problem.

what should i do? should i preprocess them into one-text-per-line and use tokenizer.train?

much thanks!

tokenizers version: 0.13.2

Narsil commented 1 year ago

How large is the dataset ?

The iterator itself should not hold onto the memory, but the BPE algorithm needs to keep around all byte pairs in the original data, that necessarily grows.

Keep in mind that for HUGE datasets, you're also likely to encounter an u32 overflow in the pair counts leading to crashes or worse, silently bad tokenizer. (There is no plan to move to u64 for now or heuristic to use the better one).

Maxlinn commented 1 year ago

thanks for replying!

the dataset is a few hundred gigabytes, but the system memory is 128 gigabytes. it seems i underestimated the storage needed for storing (2-gram, counts) pair. maybe i should switching to a machine with more memory available.

it came to me that another reason is possibly i'm processing Chinese corpus without segmenting them(Chinese has no natural delimiter like in English), just split sentences around punctuations, which makes the pre-tokenized piece too long(maybe 10 to 20 Chinese characters) and finally increased the 2-gram possibilities. i'll try to segment them into pieces around 3 to 4 chinese characters.

much thanks for reminding the u32 overflowing problem and i'm a little worried about it. is there anyway to detect it? or other tools like sentencepiece may not fall victim? (hope this is a suitable question)

Narsil commented 1 year ago

is a few hundred gigabytes,

That's indeed huge, almost certain to overflow u32.

i'm processing Chinese corpus without segmenting them

100% that ! BPE doesn´t work well on large spans. I should mention to you that if you don't want to pre_tokenize, there's the Unigram algorithm that should work better for those use cases. It has a limited size for tokens (16 by default) but works not so bad without pre_tokenization (don't quote me too much, I haven't actually used it extensively, nor trained many models on Chinese, but ti's designed to work better and indeed does memory wise)

Now for training Unigram, we have some implementation, but could never fully reproduce sentencepiece (there are some architecture designs which are different, which make it quite hard to get exactly the same output, we never achieved 100% parity at training time. (Also some floating points errors sometimes which do not help).

is there anyway to detect it?

Unfortunately no, there are some flags in the rust compiler to keep the overflow checks at runtime, but it will most likely slow down everything. Given the size of your data it's almost certain you're going to overflow (I know chinese has a different distribution than English, but I'm still pretty sure it would overflow).

Given your problem, I would honestly consider:

Maxlinn commented 1 year ago

i'll extend my highest gratitude to your kind help which really saved me!

after some experiements with Chinese segmentation i found the segmented piece are in fine granularity, which makes it impossible for bpe to learn high-level words. for example, the Chinese word for "machine learning" is been torn into "machine" and "learning", and bpe will never make "machine learning" a token since subwords cannot across pre-tokenized words.

finally i decided to do the sampling, and train bpe with less than one hundred gigabytes of data, hoping it will cover most of the words without exceeding u32 limits. Unigram is charming but unfortunately i have to use bpe for some compability reasons, but i'll try in the future.