stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

Stanza misses multi-word token id links for "dunno" #1377

Closed khannan-livefront closed 3 months ago

khannan-livefront commented 3 months ago

Describe the bug With the introduction of multi-word tokens (MWT) for english, we came across a test case where the tokens of a multi-word token are not linked correctly to associated token ids.

To Reproduce Steps to reproduce the behavior:

  1. Run the sentence:
    I dunno.
  2. Check the Universal Dependencies, in particular the tokens for dunno reveal that one of the tokens for that word is not linked to by its multi-word token:

[

// multi-word token here for "dunno"
// id only links to 2 and 4, missing 3  
  {
    "end_char": 7,
    "id": [
      2,
      4
    ],
    "misc": "SpaceAfter=No",
    "start_char": 2,
    "text": "dunno"
  },

// "du" is linked by this multi-word token
  {
    "deprel": "root",
    "end_char": 4,
    "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    "head": 0,
    "id": 2,
    "lemma": "do",
    "start_char": 2,
    "text": "du",
    "upos": "AUX",
    "xpos": "VBP"
  },

  / "n" not linked by multi-word token
  {
    "deprel": "advmod",
    "end_char": 5,
    "head": 2,
    "id": 3,
    "lemma": "not",
    "start_char": 4,
    "text": "n",
    "upos": "PART",
    "xpos": "RB"
  },

  // "no" is linked by multi-word token
  {
    "deprel": "discourse",
    "end_char": 7,
    "head": 2,
    "id": 4,
    "lemma": "no",
    "start_char": 5,
    "text": "no",
    "upos": "INTJ",
    "xpos": "UH"
  },
]

Expected behavior The MWT token links to all of the children tokens it encompasses. id: [2, 3, 4]

Environment (please complete the following information):

Additional context I'm not sure if this behaviour is intended or not. Are the IDs of the MWT token intended to act as a tuple, i.e. a range, or should they include every token that's a member of the multi-word token? If it's the latter then I believe this is a bug.

AngledLuffa commented 3 months ago

If I do this, I get all three Words, so I think it is behaving as expected. The UD annotation standard is to mark the start and end points (inclusive). Is there something else you observed that needs fixed?

pipe("I dunno where it went").sentences[0].tokens[1]

[
  {
    "id": [
      2,
      4
    ],
    "text": "dunno",
    "start_char": 2,
    "end_char": 7,
    "ner": "O",
    "multi_ner": [
      "O"
    ]
  },
  {
    "id": 2,
    "text": "du",
    "lemma": "do",
    "upos": "AUX",
    "xpos": "VBP",
    "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    "head": 0,
    "deprel": "root",
    "start_char": 2,
    "end_char": 4
  },
  {
    "id": 3,
    "text": "n",
    "lemma": "not",
    "upos": "PART",
    "xpos": "RB",
    "head": 2,
    "deprel": "advmod",
    "start_char": 4,
    "end_char": 5
  },
  {
    "id": 4,
    "text": "no",
    "lemma": "no",
    "upos": "INTJ",
    "xpos": "UH",
    "head": 2,
    "deprel": "discourse",
    "start_char": 5,
    "end_char": 7
  }
]
khannan-livefront commented 3 months ago

Thanks for clarifying @AngledLuffa. It's fine if it works that way. I had coded my implementation to assume every id would be linked to by the multi-word token, but it appears that this assumption is wrong. I've updated my implementation to treat the ids of the multiword token as a minmax range.

Thanks for the prompt response!