It would be nice to have a separate option to load a sentence piece model to use the integrated subword segmentation in the marian-decoder instead of requiring this to be equivalent to the vocabulary.
The reason is that I made the unconventional choice of training independent source and target language models for subword segmentation but I still use a joint vocabulary when training models. For me, independent sentence piece models often produce better subword segmentations but I still wanted to use tied embeddings in the NMT model as this seems to work better for me. Simple concatenation of SPM vocabs is not enough as the merged vocab from independently trained subword models may have overlaps.
Feature description
It would be nice to have a separate option to load a sentence piece model to use the integrated subword segmentation in the marian-decoder instead of requiring this to be equivalent to the vocabulary.
The reason is that I made the unconventional choice of training independent source and target language models for subword segmentation but I still use a joint vocabulary when training models. For me, independent sentence piece models often produce better subword segmentations but I still wanted to use tied embeddings in the NMT model as this seems to work better for me. Simple concatenation of SPM vocabs is not enough as the merged vocab from independently trained subword models may have overlaps.
Example
This would use
source-language.spm
to segment the input but still use the vocabulary fromjoint-vocab.yml
for the model.