stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.31k stars 896 forks source link

Apostrophe bug for Stanza English tokenizer model #1417

Closed sujoung closed 2 months ago

sujoung commented 3 months ago

Describe the bug It seems like some tokens with apostrophe causing some misalignment in word-level's text.

To Reproduce

import stanza
nlp = stanza.Pipeline("en", processors='tokenize')

nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")


[
  [
    {
      "id": 1,
      "text": "It",
      "start_char": 0,
      "end_char": 2
    },
    {
      "id": 2,
      "text": "is",
      "start_char": 3,
      "end_char": 5
    },
    {
      "id": 3,
      "text": "mainly",
      "start_char": 6,
      "end_char": 12
    },
    {
      "id": 4,
      "text": "the",
      "start_char": 13,
      "end_char": 16
    },
    {
      "id": [
        5,
        6
      ],
      "text": "receptionist's",
      "start_char": 17,
      "end_char": 31
    },
    {
      "id": 5,
      "text": "receptionstst" // weird text
    },
    {
      "id": 6,
      "text": "'s"
    },
    {
      "id": 7,
      "text": "responsibility",
      "start_char": 32,
      "end_char": 46
    },
    {
      "id": 8,
      "text": "to",
      "start_char": 47,
      "end_char": 49
    },
    {
      "id": 9,
      "text": "take",
      "start_char": 50,
      "end_char": 54
    },
    {
      "id": 10,
      "text": "hotel",
      "start_char": 55,
      "end_char": 60
    },
    {
      "id": 11,
      "text": "guests",
      "start_char": 61,
      "end_char": 67
    },
    {
      "id": 12,
      "text": "to",
      "start_char": 68,
      "end_char": 70
    },
    {
      "id": 13,
      "text": "the",
      "start_char": 71,
      "end_char": 74
    },
    {
      "id": 14,
      "text": "evacuation",
      "start_char": 75,
      "end_char": 85,
      "misc": "SpaceAfter=No"
    },
    {
      "id": 15,
      "text": ".",
      "start_char": 85,
      "end_char": 86,
      "misc": "SpaceAfter=No"
    }
  ]
]

I have collected a few more examples.

Expected behavior I think receptionist'stoken should be receptionist AND 's in word level, not receptionstst AND 's As for the "I went to Björkängshallen's square." sentence, both wrong text and <UNK> appeared. But I would expect this:

{
      "id": 4,
      "text": "Björkängshallen" 
    },
    {
      "id": 5,
      "text": "'s"
    }

Environment (please complete the following information):

Additional context I saw this issue #1361 Maybe it is related to this?

AngledLuffa commented 3 months ago

Which version are you using? I recall already fixing this on the dev branch. Basically I just need to get that version out there

On Wed, Sep 4, 2024, 2:53 PM Sujoung @.***> wrote:

Describe the bug It seems like some tokens with apostrophe causing some misalignment in word-level's text.

To Reproduce

import stanzanlp = stanza.Pipeline("en", processors='tokenize')

nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")

[ [ { "id": 1, "text": "It", "start_char": 0, "end_char": 2 }, { "id": 2, "text": "is", "start_char": 3, "end_char": 5 }, { "id": 3, "text": "mainly", "start_char": 6, "end_char": 12 }, { "id": 4, "text": "the", "start_char": 13, "end_char": 16 }, { "id": [ 5, 6 ], "text": "receptionist's", "start_char": 17, "end_char": 31 }, { "id": 5, "text": "receptionstst" // weird text }, { "id": 6, "text": "'s" }, { "id": 7, "text": "responsibility", "start_char": 32, "end_char": 46 }, { "id": 8, "text": "to", "start_char": 47, "end_char": 49 }, { "id": 9, "text": "take", "start_char": 50, "end_char": 54 }, { "id": 10, "text": "hotel", "start_char": 55, "end_char": 60 }, { "id": 11, "text": "guests", "start_char": 61, "end_char": 67 }, { "id": 12, "text": "to", "start_char": 68, "end_char": 70 }, { "id": 13, "text": "the", "start_char": 71, "end_char": 74 }, { "id": 14, "text": "evacuation", "start_char": 75, "end_char": 85, "misc": "SpaceAfter=No" }, { "id": 15, "text": ".", "start_char": 85, "end_char": 86, "misc": "SpaceAfter=No" } ]]

I have collected a few more examples.

  • nlp("I went to Björkängshallen's square.")

{ "id": [ 4, 5 ], "text": "Björkängshallen's", "start_char": 10, "end_char": 27 }, { "id": 4, "text": "Bjrkkngsshsn" // if no apostrophe, the token level seems to find correct text }, { "id": 5, "text": "'s" }

  • nlp("It happend at a Stockholm municipality's nursery")

{ "id": [ 6, 7 ], "text": "municipality's", "start_char": 26, "end_char": 40 }, { "id": 6, "text": "municipaltity" // weird text again }, { "id": 7, "text": "'s" },

  • nlp("My establishment's meaning")

{ "id": [ 2, 3 ], "text": "establishment's", "start_char": 3, "end_char": 18 }, { "id": 2, "text": "estabblismentn" // wrong word text }, { "id": 3, "text": "'s" },

  • nlp("We went to the Drottningholm's festival")

{ "id": [ 5, 6 ], "text": "Drottningholm's", "start_char": 15, "end_char": 30 }, { "id": 5, "text": "Drottniingommm" // wrong word text }, { "id": 6, "text": "'s" }

  • nlp("The ad on the newspaper's pages")

{ "id": [ 5, 6 ], "text": "newspaper's", "start_char": 14, "end_char": 25 }, { "id": 5, "text": "newspapper" // wrong word text }, { "id": 6, "text": "'s" }

Expected behavior I think receptionist'stoken should be receptionist AND 's in word level, not receptionstst AND 's As for the"I went to Björkängshallen's square."` sentence, both wrong text and appeared. But I would expect this:

{ "id": 4, "text": "Björkängshallen" }, { "id": 5, "text": "'s" }

Environment (please complete the following information):

  • OS: MacOS
  • Python version: 3.11
  • Stanza version: 1.8.2
  • English tokenize model : combined

Additional context I saw this issue #1361 https://github.com/stanfordnlp/stanza/issues/1361 Maybe it is related to this?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWP3YK4HDVXLIVWTMA3ZU56OXAVCNFSM6AAAAABNVGHL62VHI2DSMVQWIX3LMV43ASLTON2WKOZSGUYDMMZXGA3DONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa commented 3 months ago

Ah, I can see some of your examples are still going wrong. At least I can hopefully fix the ... that just seems wrong

On Wed, Sep 4, 2024 at 3:11 PM John Bauer @.***> wrote:

Which version are you using? I recall already fixing this on the dev branch. Basically I just need to get that version out there

On Wed, Sep 4, 2024, 2:53 PM Sujoung @.***> wrote:

Describe the bug It seems like some tokens with apostrophe causing some misalignment in word-level's text.

To Reproduce

import stanzanlp = stanza.Pipeline("en", processors='tokenize')

nlp("It is mainly the receptionist's responsibility to take hotel guests to the evacuation.")

[ [ { "id": 1, "text": "It", "start_char": 0, "end_char": 2 }, { "id": 2, "text": "is", "start_char": 3, "end_char": 5 }, { "id": 3, "text": "mainly", "start_char": 6, "end_char": 12 }, { "id": 4, "text": "the", "start_char": 13, "end_char": 16 }, { "id": [ 5, 6 ], "text": "receptionist's", "start_char": 17, "end_char": 31 }, { "id": 5, "text": "receptionstst" // weird text }, { "id": 6, "text": "'s" }, { "id": 7, "text": "responsibility", "start_char": 32, "end_char": 46 }, { "id": 8, "text": "to", "start_char": 47, "end_char": 49 }, { "id": 9, "text": "take", "start_char": 50, "end_char": 54 }, { "id": 10, "text": "hotel", "start_char": 55, "end_char": 60 }, { "id": 11, "text": "guests", "start_char": 61, "end_char": 67 }, { "id": 12, "text": "to", "start_char": 68, "end_char": 70 }, { "id": 13, "text": "the", "start_char": 71, "end_char": 74 }, { "id": 14, "text": "evacuation", "start_char": 75, "end_char": 85, "misc": "SpaceAfter=No" }, { "id": 15, "text": ".", "start_char": 85, "end_char": 86, "misc": "SpaceAfter=No" } ]]

I have collected a few more examples.

  • nlp("I went to Björkängshallen's square.")

{ "id": [ 4, 5 ], "text": "Björkängshallen's", "start_char": 10, "end_char": 27 }, { "id": 4, "text": "Bjrkkngsshsn" // if no apostrophe, the token level seems to find correct text }, { "id": 5, "text": "'s" }

  • nlp("It happend at a Stockholm municipality's nursery")

{ "id": [ 6, 7 ], "text": "municipality's", "start_char": 26, "end_char": 40 }, { "id": 6, "text": "municipaltity" // weird text again }, { "id": 7, "text": "'s" },

  • nlp("My establishment's meaning")

{ "id": [ 2, 3 ], "text": "establishment's", "start_char": 3, "end_char": 18 }, { "id": 2, "text": "estabblismentn" // wrong word text }, { "id": 3, "text": "'s" },

  • nlp("We went to the Drottningholm's festival")

{ "id": [ 5, 6 ], "text": "Drottningholm's", "start_char": 15, "end_char": 30 }, { "id": 5, "text": "Drottniingommm" // wrong word text }, { "id": 6, "text": "'s" }

  • nlp("The ad on the newspaper's pages")

{ "id": [ 5, 6 ], "text": "newspaper's", "start_char": 14, "end_char": 25 }, { "id": 5, "text": "newspapper" // wrong word text }, { "id": 6, "text": "'s" }

Expected behavior I think receptionist'stoken should be receptionist AND 's in word level, not receptionstst AND 's As for the"I went to Björkängshallen's square."` sentence, both wrong text and appeared. But I would expect this:

{ "id": 4, "text": "Björkängshallen" }, { "id": 5, "text": "'s" }

Environment (please complete the following information):

  • OS: MacOS
  • Python version: 3.11
  • Stanza version: 1.8.2
  • English tokenize model : combined

Additional context I saw this issue #1361 https://github.com/stanfordnlp/stanza/issues/1361 Maybe it is related to this?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWP3YK4HDVXLIVWTMA3ZU56OXAVCNFSM6AAAAABNVGHL62VHI2DSMVQWIX3LMV43ASLTON2WKOZSGUYDMMZXGA3DONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sujoung commented 3 months ago

Which version are you using? I recall already fixing this on the dev branch. Basically I just need to get that version out there

I am using 1.8.2 now, but nice to know that one is fixed already!

AngledLuffa commented 2 months ago

Well, some are still broken. In particular it's super ugly that it 1) prints out <UNK> and 2) isn't able to copy unknown characters, even though there's a copy mechanism. Those are both relatively easy to fix. What's not quite easy to fix is that it comes up with random hallucinations in some cases. It uses a seq2seq model to produce the word splits from the original text, and sometimes that model is hallucinating in weird ways. I wonder if the correct mechanism is that for languages such as English, it can only ever output the next character or a word split

AngledLuffa commented 2 months ago

Ah, found a case with the lemmatizer where unk characters are also causing issues. There's a fallback to just use the original word if the lemmatizer tries to output <UNK> (which wasn't implemented in the MWT). So the following example gets mis-lemmatized as a result:

import stanza

pipe = stanza.Pipeline("en", processors="tokenize,pos,lemma")
doc = pipe("Jennifer has nice antennae")
print([word.lemma for sentence in doc.sentences for word in sentence.words])
doc = pipe("Jennifer has nice ãntennae")
print([word.lemma for sentence in doc.sentences for word in sentence.words])

['Jennifer', 'have', 'nice', 'antenna']
['Jennifer', 'have', 'nice', 'ãntennae']

whereas the lemmatizer is trying to output <UNK>ntenna, as one might prefer if it knew how to properly fill in the <UNK>

That makes this seem like a good opportunity to use the same process to temporarily add characters to the vocab for the lemmatizer, but at the same time I wonder if that would just lead to a tradeoff where it's hallucinating random garbage for some proper names even though it now occasionally gets the ending of plural nouns correct.

sujoung commented 2 months ago

Thanks for the investigation! I was running the tokenization and pos tagging steps for many datapoints (100K+ user texts). And I was parsing the serialized version of stanza document to a flat data model with some index validation. Then I found the apostrophe causing some issues with tokenization. I didn't test the lemmatize and higher level yet with the bulk data generation. So, maybe I am wrong with the apostrophe assumption here. But at least in my application, I regexed over apostrophe patterns, if it matches with pattern 1 with the previous token and the next word text matches with the pattern2, I was fallbacking to pattern1's group (1). Then I didn't get any problems. The regex patterns I used for example are these:

pattern1 = re.match(r"(\w+)(['’`′;])([a-zA-Z]*)$", last_token) pattern2 = re.match(r"(['’`′;][\w]*)$", next_word_text)

I don't think this should be the approach in the code from performance point of view + it is a hard coded heuristic, but just wanted to say that random character shuffling seems to occur with apostrophe cases according to my observation.

AngledLuffa commented 2 months ago

Well, on our side, I think I have a model which will definitively fix the problem for English and other languages where the default MWT splitting is to split the pieces so they are exactly the MWT token. Instead of the seq2seq, it appears a classifier over the characters works just as well and leaves no ambiguity for the model to hallucinate. I will test it a bit more and then try to make a new release with that model.