Is there a way to integrate YouTokenToMe's tokenisation and training process?

minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.

https://docs.aitextgen.io

MIT License

1.84k stars 220 forks source link

Is there a way to integrate YouTokenToMe's tokenisation and training process? #149

Open ckoshka opened 3 years ago

ckoshka commented 3 years ago

It says it's around 90 times faster and I was able to convert 1GB of Thai text in under 30 seconds, also doesn't appear to cause OOM errors or overusing RAM.

https://github.com/VKCOM/YouTokenToMe

ckoshka commented 3 years ago

Never mind, all you need to do is train the tokeniser using YTTM, take the vocab file it outputs, strip out the numbers, and use it as the training file for train_tokeniser which does it pretty much instantly. Maybe this could be added to the official notebook?

This still means encoding will take a while though. YTTM outputs a text file containing the IDs and I'm not sure how to convert that into the .tar.gz format that aitextgen expects