stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

Stanza 1.7.0+ makes breaking API changes for possessives, tokens excluding `end_char` and `start_char` fields #1361

Open khannan-livefront opened 7 months ago

khannan-livefront commented 7 months ago

Describe the bug I'm updating Stanza from 1.6.1 to 1.7.x / 1.8.x and noticed a number of breaking API changes in the Stanza Token result when handling possessives.

To Reproduce

  1. Send to stanza a sentence containing a possessive apostrophe like Joe's dog..
  2. Look at the Universal Dependencies.

Stanza now includes a new additional token that I'll call an "aggregate token" with the text field Joe's. This new aggregate token comes in addition to the tokens for Joe and 's. The new aggregate token returns an id with a list to the other two tokens:

// the new aggregate token appearing for each possessive apostrophe
    {
      "end_char": 5,
      "id": [
        1,
        2
      ],
      "start_char": 0,
      "text": "Joe's"
    },
 // child tokens missing start_char and end_char fields
    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 3,
      "id": 1,
      "lemma": "Joe",
      "text": "Joe",
      "upos": "PROPN",
      "xpos": "NNP"
    },
    {
      "deprel": "case",
      "head": 1,
      "id": 2,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },
    // normal tokens
    {
      "deprel": "root",
      "end_char": 9,
      "feats": "Number=Sing",
      "head": 0,
      "id": 3,
      "lemma": "dog",
      "start_char": 6,
      "text": "dog",
      "upos": "NOUN",
      "xpos": "NN"
    },
    {
      "deprel": "punct",
      "end_char": 10,
      "head": 3,
      "id": 4,
      "lemma": ".",
      "start_char": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "."
    }

This breaks the one-to-one mapping that used to exist between tokens and word elements within the s-expression returned by the constituency tree:

(ROOT (NP (NP (NNP Joe) (POS 's)) (NN dog) (. .)))

But more problematically, this new aggregate token is now the only token containing the end_char and start_char data about the word.

In addition to being a breaking change, this new approach is quite hard for application developers to work with. To parse it they need to chase down the ID links of the aggregate token when it intermittently appears to map its linguistic data. Moreover, important character information about where the character delineation between a word and its apostrophe is lost.

Expected behavior For a possessive like Joe's dog., Stanza returns four dependency tokens as before in Stanza 1.6.1:

  {
    "deprel": "nmod:poss",
    "end_char": 3,
    "feats": "Number=Sing",
    "head": 3,
    "id": 1,
    "lemma": "Joe",
    "start_char": 0,
    "text": "Joe",
    "upos": "PROPN",
    "xpos": "NNP"
  },
  {
    "deprel": "case",
    "end_char": 5,
    "head": 1,
    "id": 2,
    "lemma": "'s",
    "start_char": 3,
    "text": "'s",
    "upos": "PART",
    "xpos": "POS"
  },
  {
    "deprel": "root",
    "end_char": 9,
    "feats": "Number=Sing",
    "head": 0,
    "id": 3,
    "lemma": "dog",
    "start_char": 6,
    "text": "dog",
    "upos": "NOUN",
    "xpos": "NN"
  },
  {
    "deprel": "punct",
    "end_char": 10,
    "head": 3,
    "id": 4,
    "lemma": ".",
    "start_char": 9,
    "text": ".",
    "upos": "PUNCT",
    "xpos": "."
  }

Or if a fifth aggregate token with an array of id's continues to be returned, the non-aggregate child tokens at least retain their own end_char and start_char information as before. This would at least allow developers to ignore these aggregate tokens, and preserve information about the character delineation between each token.

Environment (please complete the following information):

AngledLuffa commented 7 months ago

As you have surmised, this was an intentional breaking change. English was actually handled differently from almost every other language where multiple syntactic "words" are written as a single "token". In general they are labeled as MWT (multi-word tokens), such as in Spanish, where the direct and indirect object pronouns can be attached to certain forms of verbs. In the case of English, there are a few classes of words which fit that category:

possessives contractions: can't, won't, ... contractions which don't even have ': gonna, wanna, cannot, ...

So the first thing you can do is do your processing on the words instead of the tokens on a sentence, such as

pipe = stanza.Pipeline("en", processors="tokenize")
doc = pipe("This change is gonna annoy people")
doc.sentences[0].words[4]

{
  "id": 4,
  "text": "gon"
}

If you're using the json output format, the MWT are always marked as having an id of more than one value

>>> doc.sentences[0].tokens[3]
[
  {
    "id": [
      4,
      5
    ],
    "text": "gonna",
    "start_char": 15,
    "end_char": 20
  },
  {
    "id": 4,
    "text": "gon"
  },
  {
    "id": 5,
    "text": "na"
  }
]

As you point out, this is missing the character positions on the words. This is because, in some languages, the tokenization standard is to rewrite the word pieces to match the actual word, so we'd have going to instead of gon na as the text of those words. Still, I can see how it would be useful to put start_char and end_char on the words if the word pieces happen to add up to the MWT, so I can make that a TODO. In the English datasets, the standard is to split the original text into pieces which correspond to the actual text rather than rewriting.

There's another really annoying issue, which is that the NER training can be either MWT or words (generally words), whereas for some reason the NER processor uses the MWT instead of the words. As a result, it doesn't always correctly label possessives. I should mark that as another TODO.

If there are other items which would make this more compatible with your previous workflow, please let us know

AngledLuffa commented 7 months ago
khannan-livefront commented 7 months ago

Thank you for your prompt response @AngledLuffa! Adding start_char and end_char back to the original tokens would be very helpful for us. This would allow us to skip the processing of MWT tokens.

Since you asked, our biggest architectural dependency with Stanza is that we rely on there being a one-to-one mapping of Universal Dependencies to leaf nodes of the Constituency Parse. The two inputs are mapped as linked objects in our system. E.g. so currently the word can't maps as two tokens to:

can not

and in the constituency tree, to two leaf nodes:

(MD ca) (RB n't)

and this one-to-one relationship is linked in our system as objects: a given token can fetch its constituent node, and vice versa. So if "can't" becomes one MWT token, our system would break unless the constituency tree also maps "can't" as a single leaf node.

Thankfully it looks I can still maintain this relationship by skipping over MWT tokens as the original tokens still have this one-to-one mapping in Stanza 1.8.1. Returning the start_char and end_char fields back to the original tokens would be all that's needed to give us a smooth upgrade path.

Thanks again for the prompt response!

AngledLuffa commented 6 months ago

Alright, I separated the Word start & end chars for situations where the pieces add up to the surrounding Token (the Token, again, being the MWT representation). I should emphasize that may be cases in English where the pieces it adds up are not actually the full Token, in which case there won't be a start & end char. If'n you come across those and it isn't properly tokenizing them, we can take a look. The change is currently in the dev branch.

khannan-livefront commented 6 months ago

@AngledLuffa Just built a version of Stanza from the dev branch with the latest changes, and I can see that the start_char and end_char are back. Our API integration is working again when I add a small check to exclude MWT tokens. Thank you so much @AngledLuffa!! ✨ 🤩

khannan-livefront commented 6 months ago

@AngledLuffa Do you know when the next release of Stanza will be so I can leap frog to that one?

AngledLuffa commented 6 months ago

Depends on if any show-stopping bugs show up, I suppose. Probably a couple months if nothing critical comes up

AngledLuffa commented 6 months ago

I'd actually prefer to leave this as open until I figure out what to do with NER tags, btw

AngledLuffa commented 5 months ago

This is now part of the 1.8.2 release

khannan-livefront commented 5 months ago

Thanks @AngledLuffa we have now migrated to the Stanza 1.8.2!