Using new Custom Sentencepiece Encoder for Custom Languages

google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

https://arxiv.org/abs/1910.10683

Apache License 2.0

6.15k stars 756 forks source link

Using new Custom Sentencepiece Encoder for Custom Languages #229

Open trisongz opened 4 years ago

trisongz commented 4 years ago

Hi - I'm working on a translation based T5 model with more uncommon languages that are currently out of vocab for the existing spe model. I trained a new spe model with 350k vocab_size, taking into account the note in the code for pad_id=0, eos_id=1, unk_id=2, bos_id=-1 as required parameters.

I was able to get all steps working prior to changing the SPE model, with the colab notebook on a custom dataset, but now after switching the existing vocab model out for the new custom one, the model won't train.

During data validation, everything looks good.

However, during pre-training (since fine-tuning requires fitting the same vocab size), it returned a dimension error for the vocab.

What would I need to adjust if I'm running the MtfModel for pre-training, or is it only possible with the mesh_transformer model? And if so, where would I define the vocab size for the model in the gin file?

adarob commented 4 years ago

Our gin file includes the following to reserve tokens for sentinels during pretraining:

vocabularies.Vocabulary.extra_ids = 100

trisongz commented 4 years ago

Thanks for the response @adarob. I had saw that in the config but wasn't sure if that was meant for special token ids or any. So just to clarify, given the default SPM encoder has 35k tokens, and I am trying to extend it to 350k, it would be:

vocabularies.Vocabulary.extra_ids = 365000 ?

adarob commented 4 years ago

The SentencePieceVocabulary class will load the correct size from your spm model and the extra_ids should stay at 100 if you're using the same pre-training objective as in T5. This is probably not the issue.

Instead, it appears that the vocab size may be too large for the model to support. @nshazeer do you know where the [0, 250112] range is coming from?

nshazeer commented 4 years ago

I have no idea where 250112 is coming from. Some wild guesses:

a model previously trained with another vocabulary?
Different vocabularies being used for the input and output ... maybe they are getting mixed up somehow, or the model is set up to share the embedding, where it can't be due to different vocabularies?

trisongz commented 4 years ago

I was starting off using the Colab that was provided as a demo, and gradually experimented with different parameters. With this one in particular, I had originally tried to initialize the training with

model = t5.models.MtfModel(params)
model.train(
    mixture_or_task_name="translation",
    init_checkpoint=None,
    steps=TRAIN_STEPS
)

I also tried using mesh_transformer (following this guide) but he seems to load a tfds for pre-training whereas I'm using a tsv t5.task dataset adapter.

To do a pure LM from scratch with the expanded vocab (custom SPE), would it be better to do it with a gin file using mesh_transformer or is there a way to do it with the model_api given a tsv dataset (target only)?

ashispapu commented 4 years ago

Hi @trisongz @adarob , Is there any way to increase the tokens for custom sentence piece model and use it for pre-training? I am also looking for the configuration in the gin file to accommodate such changes. I have trained a sentence piece model with 100k tokens and wants to use it for T5 pre-training.

adarob commented 4 years ago

@ashispapu, you can provide the path to your sentencepiece model in the output_features arg of the Task definition.

ashispapu commented 4 years ago

Thanks. Below is my understanding.

It takes care of modifying the default embedding layer(32256 768) and replaces with the no of token available in the custom sentence piece model i.e (100k 768) is the new embedding layer.