Though not explicitly stated in the paper, I understand that mT5 uses a SentencePiece Unigram tokenizer (please correct me if I am wrong). I cannot seem to find how much data this tokenizer was trained on.
The mT5 paper says, "As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) models trained with the language sampling rates used during pre-training." The T5 paper says, "Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian.", but I do not see what the raw GB and/or token counts for the training data for the tokenizer.
How much data was the tokenizer trained on? (And, if you recall, approximately how long did it take to train, and how much RAM was required?)
Though not explicitly stated in the paper, I understand that mT5 uses a SentencePiece Unigram tokenizer (please correct me if I am wrong). I cannot seem to find how much data this tokenizer was trained on.
The mT5 paper says, "As in T5, we use SentencePiece (Kudo and Richardson, 2018; Kudo, 2018) models trained with the language sampling rates used during pre-training." The T5 paper says, "Then, we trained our SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian.", but I do not see what the raw GB and/or token counts for the training data for the tokenizer.
How much data was the tokenizer trained on? (And, if you recall, approximately how long did it take to train, and how much RAM was required?)