nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

Open lkrsnik opened 1 year ago

lkrsnik commented 1 year ago

To Reproduce

trainer = trankit.TPipeline(
    training_config={
        'category': 'customized'
        'task': 'lemmatize',
        'save_dir': <PATH>,
        'train_conllu_fpath': <PATH>,
        'dev_conllu_fpath': <PATH>
    }
)
trainer.train()

Expected behavior The trained model should produce lemmas with accuracy on par with the default Slovenian model.

Environment:

Temporary solution A temporary fix has been added by modifying the following code:

import trankit
from trankit.utils.mwt_lemma_utils.seq2seq_utils import VOCAB_PREFIX, SOS, EOS

trankit.utils.mwt_lemma_utils.seq2seq_vocabs.EMPTY = SOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.ROOT = EOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.VOCAB_PREFIX = VOCAB_PREFIX

Note: The provided temporary solution seems to address the issue, but a more permanent fix may be required in the Trankit library to avoid the need for this workaround.