nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
731 stars 102 forks source link

lemma with # sign in Finnish language #70

Open mrgransky opened 1 year ago

mrgransky commented 1 year ago

Given the following code snippet:

import json
from trankit import Pipeline

p = Pipeline('auto', embedding='xlm-roberta-large')

doc = '''Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.'''

tokens = p(doc, is_sent=True)
print(json.dumps(tokens, indent=2, ensure_ascii=False))

For some reason, I get # in my lemma as seen in this sample doc:

{
  "text": "Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.",
  "tokens": [
    {
      "id": 1,
      "text": "Naton",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 2,
      "deprel": "nmod:poss",
      "span": [
        0,
        5
      ],
      "lemma": "Nato"
    },
    {
      "id": 2,
      "text": "päämajassa",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        6,
        16
      ],
      "lemma": "pää#maja"  <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 3,
      "text": "Brysselissä",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 2,
      "deprel": "appos",
      "span": [
        17,
        28
      ],
      "lemma": "Bryssel"
    },
    {
      "id": 4,
      "text": "järjestettiin",
      "upos": "VERB",
      "xpos": "V",
      "feats": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass",
      "head": 0,
      "deprel": "root",
      "span": [
        29,
        42
      ],
      "lemma": "järjestää"
    },
    {
      "id": 5,
      "text": "iltapäivällä",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ade|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        43,
        55
      ],
      "lemma": "ilta#päivä" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 6,
      "text": "Suomen",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 8,
      "deprel": "nmod:poss",
      "span": [
        56,
        62
      ],
      "lemma": "Suomi"
    },
    {
      "id": 7,
      "text": "virallinen",
      "upos": "ADJ",
      "xpos": "A",
      "feats": "Case=Nom|Degree=Pos|Derivation=Llinen|Number=Sing",
      "head": 8,
      "deprel": "amod",
      "span": [
        63,
        73
      ],
      "lemma": "virallinen"
    },
    {
      "id": 8,
      "text": "liittymisseremonia",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Nom|Number=Sing",
      "head": 4,
      "deprel": "obj",
      "span": [
        74,
        92
      ],
      "lemma": "liittyä#seremoni" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "Punct",
      "head": 4,
      "deprel": "punct",
      "span": [
        92,
        93
      ],
      "lemma": "."
    }
  ],
  "lang": "finnish"
}

I tired it both in Colab and terminal, but same results!

What am I doing wrong?

PS, I do not get the same error in demo website: bild

Cheers,

OttoTarkka commented 1 year ago

Not an error, the component words of compound words (Finnish: yhdyssana) are separated by the '#' sign by design.

mrgransky commented 6 months ago

but this only occurs when Standard package TDT is used, FTB would not lead into the same issue.