marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 126 forks source link

separate option for loading a sentence piece model in marian-decoder #930

Open jorgtied opened 2 years ago

jorgtied commented 2 years ago

Feature description

It would be nice to have a separate option to load a sentence piece model to use the integrated subword segmentation in the marian-decoder instead of requiring this to be equivalent to the vocabulary.

The reason is that I made the unconventional choice of training independent source and target language models for subword segmentation but I still use a joint vocabulary when training models. For me, independent sentence piece models often produce better subword segmentations but I still wanted to use tied embeddings in the NMT model as this seems to work better for me. Simple concatenation of SPM vocabs is not enough as the merged vocab from independently trained subword models may have overlaps.

Example

marian-decoder --source-spm source-language.spm --vocabs joint-vocab.yml joint-vocab.yml ...

This would use source-language.spm to segment the input but still use the vocabulary from joint-vocab.yml for the model.