rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Add tokens after pretraining #97

Closed Bachstelze closed 3 years ago

Bachstelze commented 3 years ago

Is it possible to add more tokens to the vocab and model after the pretraining of e.g. mBART or mRASP?

rsennrich commented 3 years ago

Hello Bachstelze,

this is more of a question pertaining to the toolkit with which the respective models are trained. It's in principle possible to increase the size of the embedding matrix after pre-training, randomly initializing the newly added parameters, but I don't think this will be supported by most toolkits. A common workaround used is to re-assign embeddings that you don't need (for example, if you move to a new language) with some strategy - Alham Fikri Aji has looked at some strategies here (section 4.2): https://www.aclweb.org/anthology/2020.acl-main.688.pdf

best wishes, Rico