stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.26k stars 889 forks source link

Hebrew parser ends with punct if there is no "." #471

Closed KoichiYasuoka closed 3 years ago

KoichiYasuoka commented 4 years ago
>>> import stanza
>>> nlp=stanza.Pipeline("he")
>>> doc=nlp("על טעם וריח אין להתווכח")
>>> print(doc)
[
  [
    {
      "id": 1,
      "text": "על",
      "lemma": "על",
      "upos": "ADP",
      "xpos": "ADP",
      "head": 2,
      "deprel": "case",
      "misc": "start_char=0|end_char=2"
    },
    {
      "id": 2,
      "text": "טעם",
      "lemma": "טעם",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Sing",
      "head": 5,
      "deprel": "obl",
      "misc": "start_char=3|end_char=6"
    },
    {
      "id": [
        3,
        4
      ],
      "text": "וריח",
      "misc": "start_char=7|end_char=11"
    },
    {
      "id": 3,
      "text": "ו",
      "lemma": "ו",
      "upos": "CCONJ",
      "xpos": "CCONJ",
      "head": 4,
      "deprel": "cc"
    },
    {
      "id": 4,
      "text": "ריח",
      "lemma": "ריח",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Sing",
      "head": 2,
      "deprel": "conj"
    },
    {
      "id": 5,
      "text": "אין",
      "lemma": "אין",
      "upos": "AUX",
      "xpos": "AUX",
      "feats": "VerbType=Mod",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=12|end_char=15"
    },
    {
      "id": 6,
      "text": "להתווכח",
      "lemma": "התווכח",
      "upos": "VERB",
      "xpos": "VERB",
      "feats": "HebBinyan=HITPAEL|VerbForm=Inf",
      "head": 5,
      "deprel": "punct",
      "misc": "start_char=16|end_char=23"
    }
  ]
]

The 6th word "argue" is parsed as punct.

yuhui-zh15 commented 4 years ago

Thank you for reporting this and we can reproduce this error. Could you try more examples and see if it always behave like that? If this is the only case, this might be a statistical error (https://stanfordnlp.github.io/stanza/faq.html#model-predictions-are-wrong-on-some-of-my-examples-is-this-normal).

KoichiYasuoka commented 4 years ago
>>> import stanza
>>> nlp=stanza.Pipeline("he")
>>> doc=nlp("לטעם ולצבע אין שותפים")
>>> print(doc)
[
  [
    {
      "id": [
        1,
        2
      ],
      "text": "לטעם",
      "misc": "start_char=0|end_char=4"
    },
    {
      "id": 1,
      "text": "ל",
      "lemma": "ל",
      "upos": "ADP",
      "xpos": "ADP",
      "head": 2,
      "deprel": "case"
    },
    {
      "id": 2,
      "text": "טעם",
      "lemma": "טעם",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Sing",
      "head": 5,
      "deprel": "obl"
    },
    {
      "id": [
        3,
        4
      ],
      "text": "ולצבע",
      "misc": "start_char=5|end_char=10"
    },
    {
      "id": 3,
      "text": "ו",
      "lemma": "ו",
      "upos": "CCONJ",
      "xpos": "CCONJ",
      "head": 4,
      "deprel": "cc"
    },
    {
      "id": 4,
      "text": "לצבע",
      "lemma": "לצבע",
      "upos": "ADV",
      "xpos": "ADV",
      "head": 2,
      "deprel": "conj"
    },
    {
      "id": 5,
      "text": "אין",
      "lemma": "אין",
      "upos": "VERB",
      "xpos": "VERB",
      "feats": "HebExistential=True",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=11|end_char=14"
    },
    {
      "id": 6,
      "text": "שותפים",
      "lemma": "שותף",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Plur",
      "head": 5,
      "deprel": "punct",
      "misc": "start_char=15|end_char=21"
    }
  ]
]

The 6th word "partner" is parsed as punct.

KoichiYasuoka commented 4 years ago
>>> import stanza
>>> nlp=stanza.Pipeline("he")
>>> doc=nlp("לא נתווכח כלל כי לנו לא אכפת")
>>> print(doc)
[
  [
    {
      "id": 1,
      "text": "לא",
      "lemma": "לא",
      "upos": "ADV",
      "xpos": "ADV",
      "feats": "Polarity=Neg",
      "head": 2,
      "deprel": "advmod",
      "misc": "start_char=0|end_char=2"
    },
    {
      "id": 2,
      "text": "נתווכח",
      "lemma": "התווכח",
      "upos": "VERB",
      "xpos": "VERB",
      "feats": "Gender=Fem,Masc|HebBinyan=HITPAEL|Number=Plur|Person=1|Tense=Fut",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=3|end_char=9"
    },
    {
      "id": 3,
      "text": "כלל",
      "lemma": "כלל",
      "upos": "ADV",
      "xpos": "ADV",
      "head": 2,
      "deprel": "advmod",
      "misc": "start_char=10|end_char=13"
    },
    {
      "id": 4,
      "text": "כי",
      "lemma": "כי",
      "upos": "SCONJ",
      "xpos": "SCONJ",
      "head": 6,
      "deprel": "mark",
      "misc": "start_char=14|end_char=16"
    },
    {
      "id": [
        5,
        6
      ],
      "text": "לנו",
      "misc": "start_char=17|end_char=20"
    },
    {
      "id": 5,
      "text": "ל_",
      "lemma": "ל",
      "upos": "ADP",
      "xpos": "ADP",
      "head": 6,
      "deprel": "case"
    },
    {
      "id": 6,
      "text": "_אנחנו",
      "lemma": "הוא",
      "upos": "PRON",
      "xpos": "PRON",
      "feats": "Gender=Fem,Masc|Number=Plur|Person=1|PronType=Prs",
      "head": 2,
      "deprel": "ccomp"
    },
    {
      "id": 7,
      "text": "לא",
      "lemma": "לא",
      "upos": "ADV",
      "xpos": "ADV",
      "feats": "Polarity=Neg",
      "head": 2,
      "deprel": "advmod",
      "misc": "start_char=21|end_char=23"
    },
    {
      "id": 8,
      "text": "אכפת",
      "lemma": "אכפת",
      "upos": "PUNCT",
      "xpos": "PUNCT",
      "head": 2,
      "deprel": "punct",
      "misc": "start_char=24|end_char=28"
    }
  ]
]

The 8th word "concern" is tagged as PUNCT and parsed as punct.

yuhui-zh15 commented 4 years ago

Thanks for reporting this, I also just randomly input some words in Hebrew and confirm this is the case. This issue might be fixed in a future release where we try some data augmentation methods or we can find more data in Hebrew and pool them together (which hopefully can cover more cases like these). However we cannot make a promise on when that will happen. Currently we suggest you try to manually fix the problem by adding a processor that will add punct to the end of the sentence. Besides, if you know any additional resources in Hebrew that we can leverage, please feel free to let us know!

AngledLuffa commented 3 years ago

Would you try downloading the linked models and replacing the originals?

They should avoid the problem of always ending a sentence with punct, at the cost of a tiny bit of LAS

AngledLuffa commented 3 years ago

http://nlp.stanford.edu/~horatio/htb_depparse.pt : parser, should go in ~/stanza_resources/he/depparse/htb.pt

http://nlp.stanford.edu/~horatio/htb_pos.pt : tagger, should go in ~/stanza_resources/he/pos/htb.pt

KoichiYasuoka commented 3 years ago

Thank you @AngledLuffa , I've replaced he/depparse/htb.pt and he/pos/htb.pt:

>>> import stanza
>>> nlp=stanza.Pipeline("he")
>>> doc=nlp("על טעם וריח אין להתווכח")
>>> print(doc)
[
  [
    {
      "id": 1,
      "text": "על",
      "lemma": "על",
      "upos": "ADP",
      "xpos": "ADP",
      "head": 2,
      "deprel": "case",
      "misc": "start_char=0|end_char=2"
    },
    {
      "id": 2,
      "text": "טעם",
      "lemma": "טעם",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Sing",
      "head": 5,
      "deprel": "obl",
      "misc": "start_char=3|end_char=6"
    },
    {
      "id": [
        3,
        4
      ],
      "text": "וריח",
      "misc": "start_char=7|end_char=11"
    },
    {
      "id": 3,
      "text": "ו",
      "lemma": "ו",
      "upos": "CCONJ",
      "xpos": "CCONJ",
      "head": 4,
      "deprel": "cc"
    },
    {
      "id": 4,
      "text": "ריח",
      "lemma": "ריח",
      "upos": "NOUN",
      "xpos": "NOUN",
      "feats": "Gender=Masc|Number=Sing",
      "head": 2,
      "deprel": "conj"
    },
    {
      "id": 5,
      "text": "אין",
      "lemma": "אין",
      "upos": "AUX",
      "xpos": "AUX",
      "feats": "VerbType=Mod",
      "head": 6,
      "deprel": "aux",
      "misc": "start_char=12|end_char=15"
    },
    {
      "id": 6,
      "text": "להתווכח",
      "lemma": "התווכח",
      "upos": "VERB",
      "xpos": "VERB",
      "feats": "HebBinyan=HITPAEL|VerbForm=Inf",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=16|end_char=23"
    }
  ]
]

Yes, yes, it's very good, very good result. Thank you again, and how did you do that?

AngledLuffa commented 3 years ago

Good to hear. What I did was added "data augmentation" to the training of two models for two languages - a fancy way of saying I removed the final punctuation from 10% of the sentences.