Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs

To Reproduce

trainer = trankit.TPipeline(
    training_config={
        'category': 'customized'
        'task': 'lemmatize',
        'save_dir': <PATH>,
        'train_conllu_fpath': <PATH>,
        'dev_conllu_fpath': <PATH>
    }
)
trainer.train()

Expected behavior The trained model should produce lemmas with accuracy on par with the default Slovenian model.

Environment:

OS: Ubuntu 18.04.5 LTS
Python version: Python 3.9.16
Trankit version: 1.1.1

Temporary solution A temporary fix has been added by modifying the following code:

import trankit
from trankit.utils.mwt_lemma_utils.seq2seq_utils import VOCAB_PREFIX, SOS, EOS

trankit.utils.mwt_lemma_utils.seq2seq_vocabs.EMPTY = SOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.ROOT = EOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.VOCAB_PREFIX = VOCAB_PREFIX

Note: The provided temporary solution seems to address the issue, but a more permanent fix may be required in the Trankit library to avoid the need for this workaround.

nlp-uoregon / trankit

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75