nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Issue with lemmatization of indonesian #45

Closed xavier-taylor closed 10 months ago

xavier-taylor commented 2 years ago

Input:

` p = Pipeline('indonesian', embedding='xlm-roberta-large')

print(p('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.')) `

Output:

Loading pretrained XLM-Roberta, this may take a while... Loading tokenizer for indonesian Loading tagger for indonesian Loading lemmatizer for indonesian

Active language: indonesian

{'text': 'Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.', 'sentences': [{'id': 1, 'text': 'Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.', 'tokens': [{'id': 1, 'text': 'Ia', 'upos': 'PRON', 'xpos': 'PS3', 'feats': 'Number=Sing|Person=3|PronType=Prs', 'head': 2, 'deprel': 'nsubj', 'dspan': (0, 2), 'span': (0, 2), 'lemma': 'ia'}, {'id': 2, 'text': 'menjadi', 'upos': 'VERB', 'xpos': 'VSA', 'feats': 'Number=Sing|Voice=Act', 'head': 0, 'deprel': 'root', 'dspan': (3, 10), 'span': (3, 10), 'lemma': 'menjadi'}, {'id': 3, 'text': 'Gubernur', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 2, 'deprel': 'obj', 'dspan': (11, 19), 'span': (11, 19), 'lemma': 'gubernur'}, {'id': 4, 'text': 'Bali', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 3, 'deprel': 'flat', 'dspan': (20, 24), 'span': (20, 24), 'lemma': 'bali'}, {'id': 5, 'text': 'menggantikan', 'upos': 'VERB', 'xpos': 'VSA', 'feats': 'Number=Sing|Voice=Act', 'head': 2, 'deprel': 'xcomp', 'dspan': (25, 37), 'span': (25, 37), 'lemma': 'mengantikan'}, {'id': 6, 'text': 'Anak', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 5, 'deprel': 'obj', 'dspan': (38, 42), 'span': (38, 42), 'lemma': 'anak'}, {'id': 7, 'text': 'Agung', 'upos': 'PROPN', 'xpos': 'ASP', 'feats': 'Degree=Pos|Number=Sing', 'head': 6, 'deprel': 'flat', 'dspan': (43, 48), 'span': (43, 48), 'lemma': 'agung'}, {'id': 8, 'text': 'Bagus', 'upos': 'PROPN', 'xpos': 'ASP', 'feats': 'Degree=Pos|Number=Sing', 'head': 7, 'deprel': 'flat', 'dspan': (49, 54), 'span': (49, 54), 'lemma': 'bagus'}, {'id': 9, 'text': 'Sutedja', 'upos': 'PROPN', 'xpos': 'X--', 'head': 8, 'deprel': 'flat', 'dspan': (55, 62), 'span': (55, 62), 'lemma': 'sutedja'}, {'id': 10, 'text': '.', 'upos': 'PUNCT', 'xpos': 'Z--', 'head': 2, 'deprel': 'punct', 'dspan': (62, 63), 'span': (62, 63), 'lemma': '.'}], 'dspan': (0, 63)}], 'lang': 'indonesian'}

Expected output:

Words like 'mengantikan' to be reduced to their lemma.

For example, here is the lemmatization from aksara: 1 Ia ia PRON Number=Sing|Person=3|PronType=Prs 2 menjadi jadi VERB Voice=Act 3 Gubernur Gubernur PROPN 4 Bali Bali PROPN 5 menggantikan ganti VERB Voice=Act 6 Anak Anak PROPN 7 Agung Agung PROPN 8 Bagus Bagus PROPN 9 Sutedja Sutedja PROPN SpaceAfter=No 10 . . PUNCT

Note how menjadi becomes jadi and menggantikan becomes ganti.

I think this issue is related to the UD GSD data.

I have seen a similar issue in stanza: https://github.com/stanfordnlp/stanza/issues/1003

Thanks for releasing this great tool.

(Ubuntu, python3.9, trankit 1.1.0)

minhhdvn commented 10 months ago

Hi @xavier-taylor , Thanks for letting us know. Our lemmatizer is a neural-based model trained on text corpus so the lemmatization results might vary depending on context. This is quite unavoidable unless we have another text corpus with higher quality of annotations for training.