google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.17k stars 756 forks source link

[T5 v1.1] PreTrained Tokenizer Files #521

Closed patrickvonplaten closed 4 years ago

patrickvonplaten commented 4 years ago

Hey @craffel ,

Sorry to annoy you again. Do the new T5 v1.1 models: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md use the same pretrained tokenizers as the original T5 models?

craffel commented 4 years ago

Yes. (note that mT5, which is based on T5.1.1, of course uses a different vocab)

shenfe commented 4 years ago

Same question with @patrickvonplaten , because I have tried many times to finetune T5.1.1 on some tasks, but got very lower performance than T5.1.0.

I was using the same pretrained sentencepiece model to tokenize for both T5.1.0 and T5.1.1. So I have some doubt about if the two versions have different token->id mappings, which could result in the difference during training.

Arij-Aladel commented 3 years ago

@craffel I am sorry for this question I am not an expert here, but this question is in my mind for a while. I understand the difference between the pre-rained T5 models is the number of layers and consequently the number of parameters. But what is the difference then between the pre-trained tokenizers? I mean All models are pre-trained on c4, if the tokenizer is also trained on c4 corpus then why load the tokenizer with different names? Is the pre-trained tokenizer is the same for all models but when loading the pre-trained tokenizer we refer to the config of the pre-trained model which inside has the path to the same pre-trained tokenizer? Actually, I have tried the three tokenizers (small, base, big) to tokenize small samples of texts I did not notice any difference. Comparing the vocabulary of the three tokenizers I found that it is the same vocab for all tokenizers. Another question please and correct me if I am wrong. As to my knowledge, the tokenizer and data distribution go in parallel to train any model. If I want to pre-train the T5 model for different numbers of layers on Masked language modeling on (let us say any English text dataset from hugging face). Do I need to train the tokenizer for this corpus or it is enough to use pre trained T5 tokenizer?