minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
MIT License
1.83k stars 218 forks source link

Different tokenization methods #12

Open iedmrc opened 4 years ago

iedmrc commented 4 years ago

Which methods does aitextgen support for tokenization (BPE, wordpiece etc..)? If only one, how can we expand to use others?

Thanks!

minimaxir commented 4 years ago

Custom tokenizers just use BPE, since that is what GPT-2 uses.

In theory you could use arbitrary tokenizers but I haven't tested it. Not sure if there is an ROI for doing so, and if GPT-2 is OK with that.