Open trisongz opened 4 years ago
Our gin file includes the following to reserve tokens for sentinels during pretraining:
vocabularies.Vocabulary.extra_ids = 100
Thanks for the response @adarob. I had saw that in the config but wasn't sure if that was meant for special token ids or any. So just to clarify, given the default SPM encoder has 35k tokens, and I am trying to extend it to 350k, it would be:
vocabularies.Vocabulary.extra_ids = 365000 ?
The SentencePieceVocabulary class will load the correct size from your spm model and the extra_ids should stay at 100 if you're using the same pre-training objective as in T5. This is probably not the issue.
Instead, it appears that the vocab size may be too large for the model to support. @nshazeer do you know where the [0, 250112] range is coming from?
I have no idea where 250112 is coming from. Some wild guesses:
I was starting off using the Colab that was provided as a demo, and gradually experimented with different parameters. With this one in particular, I had originally tried to initialize the training with
model = t5.models.MtfModel(params)
model.train(
mixture_or_task_name="translation",
init_checkpoint=None,
steps=TRAIN_STEPS
)
I also tried using mesh_transformer (following this guide) but he seems to load a tfds for pre-training whereas I'm using a tsv t5.task dataset adapter.
To do a pure LM from scratch with the expanded vocab (custom SPE), would it be better to do it with a gin file using mesh_transformer or is there a way to do it with the model_api given a tsv dataset (target only)?
Hi @trisongz @adarob , Is there any way to increase the tokens for custom sentence piece model and use it for pre-training? I am also looking for the configuration in the gin file to accommodate such changes. I have trained a sentence piece model with 100k tokens and wants to use it for T5 pre-training.
@ashispapu, you can provide the path to your sentencepiece model in the output_features arg of the Task definition.
Thanks. Below is my understanding.
It takes care of modifying the default embedding layer(32256 768) and replaces with the no of token available in the custom sentence piece model i.e (100k 768) is the new embedding layer.
Hi - I'm working on a translation based T5 model with more uncommon languages that are currently out of vocab for the existing spe model. I trained a new spe model with 350k vocab_size, taking into account the note in the code for pad_id=0, eos_id=1, unk_id=2, bos_id=-1 as required parameters.
I was able to get all steps working prior to changing the SPE model, with the colab notebook on a custom dataset, but now after switching the existing vocab model out for the new custom one, the model won't train.
During data validation, everything looks good.
However, during pre-training (since fine-tuning requires fitting the same vocab size), it returned a dimension error for the vocab.
What would I need to adjust if I'm running the MtfModel for pre-training, or is it only possible with the mesh_transformer model? And if so, where would I define the vocab size for the model in the gin file?