huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.99k stars 789 forks source link

Size limit for input text file #159

Closed manueltonneau closed 4 years ago

manueltonneau commented 4 years ago

Hi all,

Thanks for this great contribution :)

I was using the module to build a WordPiece vocab, using a very big txt file as input (115GB). Loading the data and tokenizing the words worked fine. As the pairs couting was going on, I got this error memory allocation of 150994960 bytes failedAborted. Any idea why this could happen? image

Thanks in advance!

Narsil commented 4 years ago

Well your system just ran out of memory. 115Go is pretty large for a text file.

Does your test file include real text (with spaces and sentences) ? It will use less memory if your text file is quite redundant in terms of words/pairs. Just guessing here I'm not from hugging face.

manueltonneau commented 4 years ago

Exactly, it includes one normal sentence per line. You're right, I was running another process on the side which didn't help. Rerunning it now and it seems to work. Thanks a lot :)

julien-c commented 4 years ago

Out of curiosity, what kind of RAM did you have on that machine?