Open ckoshka opened 3 years ago
Never mind, all you need to do is train the tokeniser using YTTM, take the vocab file it outputs, strip out the numbers, and use it as the training file for train_tokeniser which does it pretty much instantly. Maybe this could be added to the official notebook?
This still means encoding will take a while though. YTTM outputs a text file containing the IDs and I'm not sure how to convert that into the .tar.gz format that aitextgen expects
It says it's around 90 times faster and I was able to convert 1GB of Thai text in under 30 seconds, also doesn't appear to cause OOM errors or overusing RAM.
https://github.com/VKCOM/YouTokenToMe