how to train the sentencepiece tokenizer

Hi world2Vec,

I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).

Prepare data for training Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):

sp-train data/corpora-* sp-text.txt sp-model Encode corpora, producing numpy files:

sp-encode data/corpora-* sp-model.model data/encoded

Hope that's helpful!

Cheers, Preston

google-research / multilingual-t5

how to train the sentencepiece tokenizer #47