Stanza resolves wrong text for tokens in a multi-word token

khannan-livefront commented 6 months ago

Describe the bug In response to the changes of multi-word tokens in #1361, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions.

To Reproduce Steps to reproduce the behavior:

Run the sentence:

The schoolmaster's wife started a sewing class.

Check the Universal Dependencies, in particular the tokens for schoolmaster's reveal the incorrect base word of schoolmaterr:


// MWT token is correct:

    {
      "end_char": 18,
      "id": [
        2,
        3
      ],
      "start_char": 4,
      "text": "schoolmaster's"
    },

// non-MWT text resolves incorrectly to "schoolmaterr":

    {
      "deprel": "nmod:poss",
      "feats": "Number=Sing",
      "head": 4,
      "id": 2,
      "lemma": "schoolmaterr",
      "text": "schoolmaterr",
      "upos": "NOUN",
      "xpos": "NN"
    },

 // correct: 

    {
      "deprel": "case",
      "head": 2,
      "id": 3,
      "lemma": "'s",
      "text": "'s",
      "upos": "PART",
      "xpos": "POS"
    },

Expected behavior The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster / 's

Environment (please complete the following information):

OS: Mac OS Ventura
Python version: Python 3.12.2 using Poetry 1.8.2
Stanza version: Stanza from the dev branch up to commit b62c1e7f8e0e17e

Additional context I believe we found more errors like this, I will report them when I come across them.

AngledLuffa commented 6 months ago

Ultimately we need a major upgrade to the tokenizer, including the MWT splitting mechanism. There is a copy mechanism which tries to use the original text, but clearly it misses some cases. If you collect a few of these weird outliers, we'll rebuild the training data using them, and it should improve the results.

On Tue, Mar 19, 2024 at 3:58 PM Kelsey Hannan @.***> wrote:

Describe the bug In response to the changes of multi-word tokens in #1361 https://github.com/stanfordnlp/stanza/issues/1361, I encountered an error with how Stanza generates tokens for words with apostrophes, particularly contractions.

To Reproduce Steps to reproduce the behavior:

Run the sentence:

The schoolmaster's wife started a sewing class.

Check the Universal Dependencies, in particular the tokens for schoolmaster's reveal the incorrect base word of schoolmaterr:

// MWT token is correct: {"end_char"=>18, "id"=>[2, 3], "start_char"=>4, "text"=>"schoolmaster's"}, // non-MWT text resolves incorrectly to "schoolmaterr" {"deprel"=>"nmod:poss", "feats"=>"Number=Sing", "head"=>4, "id"=>2, "lemma"=>"schoolmaterr", "text"=>"schoolmaterr", "upos"=>"NOUN", "xpos"=>"NN"},

// correct {"deprel"=>"case", "head"=>2, "id"=>3, "lemma"=>"'s", "text"=>"'s", "upos"=>"PART", "xpos"=>"POS"},

Expected behavior The non-MWT part of "schoolmaster's" resolves the tokens as schoolmaster / 's

Environment (please complete the following information):

OS: Mac OS Ventura

Python version: Python 3.12.2 using Poetry 1.8.2

Stanza version: Stanza from the dev branch up to commit b62c1e7 https://github.com/stanfordnlp/stanza/commit/b62c1e7f8e0e17ea38d49683aea8a0cfb3db8b90

Additional context I believe we found more errors like this, I will report them when I come across them.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWM2TEXVFYQL2NZANR3YZC7IPAVCNFSM6AAAAABE6OQT6KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TMMJZGM4DSOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

khannan-livefront commented 6 months ago

@AngledLuffa Yes it appears the splitting mechanism is actually quite broken. We run stanza against gigantic amounts of text, with no problem retaining the correct text values when we did so against versions of Stanza before the introduction of MWT tokens. In our latest runs we encountered thousand's of errors involving them, and here are a sample of sentences where the token resolve the wrong capitalization or munges the text of the original word (sometimes inexplicably) for word's with an apostrophe / contraction / possessive:

In God’s name let him be so.
Didn’t I say so?
They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
Pulcheria Alexandrovna's illness was a strange nervous one.
Couldn't he stop and retract it all?
She'll be my nurse.
It was quite an accident Lebeziatnikov's turning up.
Wasn't I right in saying that we were birds of a feather?
Couldn't he come?
She'll get it at the shop, my dear.
Who married the Marquis of Saint-Méran's daughter?
But to Dantès' eye there was no darkness.
Couldn’t he have waited for the good season to increase his chances?
She'd never noticed if it hadn't been for Sid.
Wasn't that a happy thought?

E.g. digging into the types of failures being seen, by showing the token split for the text value:

In god’s name let him be so. -> God's / god / 's (loss of capitalization)

Didn't I say so. -> Didn't / did / not (loss of capitalization)

They began to be frightened at last at Pulcheria Alexandrovna's strange silence. -> Alexandrovna's / Alexandronan / 's (flat out wrong)

Couldn't he stop and retract it all? -> Couldn't / could / n't (loss of capitalization)

It was quite an accident Lebeziatnikov's turning up. -> Lebeziatnikov's / Lebeziatikovv / 's (flat out wrong)

Who married the Marquis of Saint-Méran's daughter? -> Méran's / Mran / 's (clear bug, seeing the insertion of a <UNK> tag)

But to Dantès' eye there was no darkness. -> Dantès' / Dants / ' (clear bug here, and seeing the insertion of a <UNK> tag)

Sadly I think we might have to revert back to a version of Stanza without MWT-based tokens as they don't appear to be stable enough for our purposes to rely on. :(

AngledLuffa commented 6 months ago

This is probably fixable in one of a couple different ways. The English datasets are generally built so that the MWT are exactly composed of the subwords, so there's no reason to do anything other than split the original text. Alternatively, there may be some generalization of that to other languages where the model first checks if it's supposed to use an exact split of the words, and then it uses the seq2seq model only for words that aren't an exact split.

khannan-livefront commented 6 months ago

Yeah that would be very helpful. The apostrophe splitting used to be very stable for the english model before the addition of MWT tokens.

khannan-livefront commented 6 months ago

@AngledLuffa I should note also, this bug is occurring with the constituency parse also. We get schoolmaterr and 's as the leaf nodes for schoolmaster's

AngledLuffa commented 6 months ago

Thought about it some. Realized I was probably overthinking, and the easiest thing to do was continue using the current model and make it replace the prediction characters with the original text if the length added up to the original text. Only for languages where this property happens, of course.

@qipeng

AngledLuffa commented 6 months ago

The english_mwt branch should now have a fix for most of these issues, although I wouldn't be surprised if the new model still occasionally hallucinates text which isn't the right length (in which case the new splitting mechanism won't work). Please LMK if this helps, and feel free to report whichever words still aren't split correctly.

The constituency parser issue is just because the parser as part of the pipeline is using the MWT as input, so it's getting the weird splits just like the other models.

AngledLuffa commented 6 months ago

I should note that although this helps with the OOV characters, there is another issue with words at the start of a sentence being split with the dictionary lookup and then lowercased... I don't think that's correct behavior of the model, and I can probably fix that a lot faster than it took to fix this.

AngledLuffa commented 6 months ago

I think the use of lowercasing in the MWT is just a straight up logic bug:

https://github.com/stanfordnlp/stanza/blob/c2d72bd14cf8cc28bd4e41a620692bbce5f43835/stanza/models/mwt/trainer.py#L99

https://github.com/stanfordnlp/stanza/blob/c2d72bd14cf8cc28bd4e41a620692bbce5f43835/stanza/models/mwt/trainer.py#L112

My own expectation with tokenization is that it doesn't change the characters used unnecessarily, but consider:

>>> [x.text for x in pipe("JENNIFER HAS NICE ANTENNAE").sentences[0].words]
['JENNIFER', 'HAS', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("JENNIFER'S GOT NICE ANTENNAE").sentences[0].words]
['JENNIFER', "'S", 'GOT', 'NICE', 'ANTENNAE']
>>> [x.text for x in pipe("SHE'S GOT NICE ANTENNAE").sentences[0].words]   # oops, this shows up in the dictionary
['she', "'s", 'GOT', 'NICE', 'ANTENNAE']

Maybe a reasonable fix would be to only look up all lowercase, leading uppercase, and all uppercase, and weird mixed cases just have to go through the seq2seq model

AngledLuffa commented 6 months ago

With these changes, here's what I get in the linked branch (which I'll merge into dev once I start hearing back from stakeholders in this issue) for the text you mentioned above

>>> text = """
... In God’s name let him be so.
... Didn’t I say so?
... They began to be frightened at last at Pulcheria Alexandrovna's strange silence.
... Pulcheria Alexandrovna's illness was a strange nervous one.
... Couldn't he stop and retract it all?
... She'll be my nurse.
... It was quite an accident Lebeziatnikov's turning up.
... Wasn't I right in saying that we were birds of a feather?
... Couldn't he come?
... She'll get it at the shop, my dear.
... Who married the Marquis of Saint-Méran's daughter?
... But to Dantès' eye there was no darkness.
... Couldn’t he have waited for the good season to increase his chances?
... She'd never noticed if it hadn't been for Sid.
... Wasn't that a happy thought?
... """
>>> text = text.strip().split("\n")
>>> for line in text:
...   print([x.text for x in pipe(line).sentences[0].words])

['In', 'God', '’s', 'name', 'let', 'him', 'be', 'so', '.']
['Did', 'n’t', 'I', 'say', 'so', '?']
['They', 'began', 'to', 'be', 'frightened', 'at', 'last', 'at', 'Pulcheria', 'Alexandrovna', "'s", 'strange', 'silence', '.']
['Pulcheria', 'Alexandrovna', "'s", 'illness', 'was', 'a', 'strange', 'nervous', 'one', '.']
['Could', "n't", 'he', 'stop', 'and', 'retract', 'it', 'all', '?']
['She', "'ll", 'be', 'my', 'nurse', '.']
['It', 'was', 'quite', 'an', 'accident', 'Lebeziatnikov', "'s", 'turning', 'up', '.']
['Was', "n't", 'I', 'right', 'in', 'saying', 'that', 'we', 'were', 'birds', 'of', 'a', 'feather', '?']
['Could', "n't", 'he', 'come', '?']
['She', "'ll", 'get', 'it', 'at', 'the', 'shop', ',', 'my', 'dear', '.']
['Who', 'married', 'the', 'Marquis', 'of', 'Saint', '-', 'Méran', "'s", 'daughter', '?']
['But', 'to', 'Dantès', "'", 'eye', 'there', 'was', 'no', 'darkness', '.']
['Could', 'n’t', 'he', 'have', 'waited', 'for', 'the', 'good', 'season', 'to', 'increase', 'his', 'chances', '?']
['She', "'d", 'never', 'noticed', 'if', 'it', 'had', "n't", 'been', 'for', 'Sid', '.']
['Was', "n't", 'that', 'a', 'happy', 'thought', '?']

AngledLuffa commented 6 months ago

This looks much better to me, but I will say there's an annoying 99.98% accuracy on the test set, whereas I would have expected 100% now that the casing is fixed and the model is forced to copy the input whenever possible. Hopefully it's just a random word which doesn't show up in the training data and isn't being correctly split, rather than one of the hallucinations or re-casings we're trying to fix

AngledLuffa commented 6 months ago

Eh, well, the non-100% appears to be entirely typos which were annotated in the test set and not handled in the expected manner by the model. Situations where the tokenizer isn't actually splitting an MWT, but if you force the MWT processor to make a decision, it doesn't really know how to process "Mens room" instead of "Men's room" etc

11944,11945c11943,11944
< 9     Cox
< 10    '
---
> 9     Co
> 10    x'
15578,15579c15577,15578
< 16    sheep
< 17    s
---
> 16    shee
> 17    ps
26469,26470c26467,26468
< 1     Men
< 2     s
---
> 1     Me
> 2     ns

khannan-livefront commented 6 months ago

Thanks for looking into this @AngledLuffa!! I will update my stanza to the latest on the dev branch to pick up these changes.

AngledLuffa commented 6 months ago

Just to be clear, it's not merged yet - but probably soon, since it is passing the current unit tests and is a lot better on the test cases you gave us. Especially if you report back saying you like this branch more than the current release :)

khannan-livefront commented 5 months ago

Just tested out the latest changes, no problems on my end!

AngledLuffa commented 5 months ago

This is now part of the 1.8.2 release

stanfordnlp / stanza

Stanza resolves wrong text for tokens in a multi-word token #1371