Open world2vec opened 3 years ago
Hi world2Vec,
I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).
Prepare data for training Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.
The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.
Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):
sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files:
sp-encode data/corpora-* sp-model.model data/encoded
Hope that's helpful!
Cheers, Preston
Hi, Thanks for share your good work. Could you detail how to train the mt5 sentencepiece tokenizer? Thanks.