Open mrgransky opened 1 year ago
Given the following code snippet:
import json from trankit import Pipeline p = Pipeline('auto', embedding='xlm-roberta-large') doc = '''Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.''' tokens = p(doc, is_sent=True) print(json.dumps(tokens, indent=2, ensure_ascii=False))
For some reason, I get # in my lemma as seen in this sample doc:
#
lemma
doc
{ "text": "Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.", "tokens": [ { "id": 1, "text": "Naton", "upos": "PROPN", "xpos": "N", "feats": "Case=Gen|Number=Sing", "head": 2, "deprel": "nmod:poss", "span": [ 0, 5 ], "lemma": "Nato" }, { "id": 2, "text": "päämajassa", "upos": "NOUN", "xpos": "N", "feats": "Case=Ine|Number=Sing", "head": 4, "deprel": "obl", "span": [ 6, 16 ], "lemma": "pää#maja" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 3, "text": "Brysselissä", "upos": "PROPN", "xpos": "N", "feats": "Case=Ine|Number=Sing", "head": 2, "deprel": "appos", "span": [ 17, 28 ], "lemma": "Bryssel" }, { "id": 4, "text": "järjestettiin", "upos": "VERB", "xpos": "V", "feats": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass", "head": 0, "deprel": "root", "span": [ 29, 42 ], "lemma": "järjestää" }, { "id": 5, "text": "iltapäivällä", "upos": "NOUN", "xpos": "N", "feats": "Case=Ade|Number=Sing", "head": 4, "deprel": "obl", "span": [ 43, 55 ], "lemma": "ilta#päivä" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 6, "text": "Suomen", "upos": "PROPN", "xpos": "N", "feats": "Case=Gen|Number=Sing", "head": 8, "deprel": "nmod:poss", "span": [ 56, 62 ], "lemma": "Suomi" }, { "id": 7, "text": "virallinen", "upos": "ADJ", "xpos": "A", "feats": "Case=Nom|Degree=Pos|Derivation=Llinen|Number=Sing", "head": 8, "deprel": "amod", "span": [ 63, 73 ], "lemma": "virallinen" }, { "id": 8, "text": "liittymisseremonia", "upos": "NOUN", "xpos": "N", "feats": "Case=Nom|Number=Sing", "head": 4, "deprel": "obj", "span": [ 74, 92 ], "lemma": "liittyä#seremoni" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<< }, { "id": 9, "text": ".", "upos": "PUNCT", "xpos": "Punct", "head": 4, "deprel": "punct", "span": [ 92, 93 ], "lemma": "." } ], "lang": "finnish" }
I tired it both in Colab and terminal, but same results!
What am I doing wrong?
PS, I do not get the same error in demo website:
Cheers,
Not an error, the component words of compound words (Finnish: yhdyssana) are separated by the '#' sign by design.
but this only occurs when Standard package TDT is used, FTB would not lead into the same issue.
Given the following code snippet:
For some reason, I get
#
in mylemma
as seen in this sampledoc
:I tired it both in Colab and terminal, but same results!
What am I doing wrong?
PS, I do not get the same error in demo website:
Cheers,