google-research / multilingual-t5

Apache License 2.0
1.25k stars 129 forks source link

how to train the sentencepiece tokenizer #47

Open world2vec opened 3 years ago

world2vec commented 3 years ago

Hi, Thanks for share your good work. Could you detail how to train the mt5 sentencepiece tokenizer? Thanks.

prestonfrasch commented 3 years ago

Hi world2Vec,

I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).

Prepare data for training Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):

sp-train data/corpora-* sp-text.txt sp-model Encode corpora, producing numpy files:

sp-encode data/corpora-* sp-model.model data/encoded

Hope that's helpful!

Cheers, Preston