stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.28k stars 892 forks source link

[QUESTION] Wrong lemmatization in some cases #542

Open yuhengwang1 opened 3 years ago

yuhengwang1 commented 3 years ago

Hi,

I met a problem when doing lemmatization.

When lemmatizing "Mary fries some French fries", the verb "fries" was wrongly lemmatized as "fries", which is the lemma of the noun "fries". Meanwhile, the POS of the first "fries" was correctly recognized as "VBZ".

Given the dependency on POS and the dictionary-based nature, I was wondering why the lemmatizer failed to deal with this sentence? And besides fine-tuning the model, what else we could do to fix such wrong cases?

Thanks!

AngledLuffa commented 3 years ago

The lemma annotator is not entirely dictionary based. In fact, a large chunk of the work is done in a s2s model. Both pieces are learned from the training data, so any limitation in the training data will be reflected in the model itself.

The limitation here is that "fries" does not show up in the training data as a verb anywhere. Furthermore, there seems to be a data bug (which I will file on the EWT github repo). To me, I would say that the lemma should ALWAYS be "fry", seeing as how you can have a single French fry or a single crab fry. I can see an argument that a "large fries" is a collection of French fries and can be treated as a new word with the lemma "fries"... However, some of the various sentences which have "fries" have the lemma as "fries" and others have it as "fry", with no real consistency between the usages. It's not surprising that the model would learn to be inconsistent / wrong when the source data is itself inconsistent.

As for teaching it the verb usage, that would either require nagging the EWT people to add some sentences specifically with "fry" as a verb, or it would require us mixing data sources. The latter fix is on our radar, but probably not for the upcoming version we hope to release by the end of the year.

Please leave this bug open, even if this answers your question, as sooner or later we will include data with "fry" the verb and that will hopefully fix your problem.

sent_id = answers-20111019100027AAdxgXV_ans-0008

text = Because Large Fries give you FOUR PIECES!

3 Fries fries NOUN NNS Number=Plur 4 nsubj 4:nsubj _

sent_id = answers-20111019100027AAdxgXV_ans-0019

text = - Large fries

3 fries fries NOUN NNS Number=Plur 0 root 0:root _

sent_id = answers-20111031103114AA61BW3_ans-0014

newpar id = answers-20111031103114AA61BW3_ans-p0004

text = If you're looking for a good burger, some great fries (they are

too die for!), and good drinks, go to Chickie & Pete's! 12 fries fries NOUN NNS Number=Plur 8 conj 4:obl:for|8:conj:and _

sent_id = answers-20111031103114AA61BW3_ans-0016

text = If you go, make sure you order the crab fries, you won't regret it

:) 11 fries fries NOUN NNS Number=Plur 8 obj 8:obj SpaceAfter=No

sent_id = reviews-154658-0004

text = (other items: chicken fingers, wings, asian pizza, and yam and

regular fries) 17 fries fry NOUN NNS Number=Plur 6 conj 3:appos|6:conj:and SpaceAfter=No

sent_id = reviews-196219-0003

text = SO, IF YOU WANT A BURGER AND FRIES, WELL, IT IS OK.

9 FRIES fry NOUN NNS Number=Plur 7 conj 5:obj|7:conj:and SpaceAfter=No

sent_id = reviews-068436-0006

text = Disgusting french fries is very best menu.

3 fries fry NOUN NNS Number=Plur 7 nsubj 7:nsubj _

yuhengwang1 commented 3 years ago

Thanks for the quick response!

yuhengwang1 commented 3 years ago

Also, in cases like "John buys a car" and "He always buys gifts for his mother", the model lemmatizes "buys" as "busy". Additionally, "tomatoes" and "potatoes" are also wrongly lemmatized. Hope they can be fixed in the future since these words are pretty common. Thanks!

AngledLuffa commented 3 years ago

"potatos" and "tomatos" are misspelled in the original data. It's a collection of natural text from the web, so it is not terribly surprising that typos like that show up, but it does tend to screw up models learned from this kind of data. At least our model is just learning to be a dumbass instead of learning to be a racist Nazi.

As for "busy"/"buys", this is the only example of "buys" in the data. Once again, some typo is "teaching" our model to be special.

sent_id = reviews-010433-0006

text = Only one server, too buys talking with others I guess.

1 Only only ADV RB 3 advmod 3:advmod 2 one one NUM CD NumType=Card 3 nummod 3:nummod 3 server server NOUN NN Number=Sing 0 root 0:root SpaceAfter=No 4 , , PUNCT , 3 punct 3:punct 5 too too ADV RB 6 advmod 6:advmod 6 buys busy ADJ JJ Degree=Pos|Typo=Yes 11 ccomp 11:ccomp 7 talking talk VERB VBG VerbForm=Ger 6 advcl 6:advcl 8 with with ADP IN 9 case 9:case 9 others other NOUN NNS Number=Plur 7 obl 7:obl:with 10 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 11 nsubj 11:nsubj 11 guess guess VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 3 parataxis 3:parataxis SpaceAfter=No 12 . . PUNCT . 3 punct 3:punct _

On Sun, Nov 29, 2020 at 3:40 PM cdxxiii notifications@github.com wrote:

Also, in cases like "John buys a car" and "He always buys gifts for his mother", the model lemmatizes "buys" as "busy". Additionally, "tomatoes" and "potatoes" are also wrongly lemmatized. Hope they can be fixed in the future since these words are pretty common. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/542#issuecomment-735476538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIJZT545IAJ7O2PTC3SSLLWDANCNFSM4UF7J6PA .

AngledLuffa commented 1 year ago

Using some updated datasets:

John buys a car   -> buy
Seriously, who buys black licorice in bulk?   -> buy
He is busy working on a few weird lemma errors   -> busy
French fries are made of potatoes   -> fry, potato
French fries are made of tomatoes   -> fry, tomato
He fries chicken every night   -> fry
I ate some fries from Popeye's   -> fry

but weirdly, in that last one, the Popeye's is not tokenized. I tried a few other 's, and there are some cases where it works at the end of a sentence and some where it doesn't

The biggest figure skates in the house are my wife's    ->  wife   's
My wife's figure skates are bigger than the kids'   ->  wife  's,   kids   '
Of all the antennae in Star Trek, the ones I like most are Jennifer's   ->   Jennifer's
Jennifer's antennae are the best looking antennae in Star Trek   ->   Jennifer   's
The cell hidden in the toilet is Walt's   ->   Hank's
Walt's cell phone is hidden in the toilet   ->   Hank   's
These screwed up looking x-rays are my hip's   ->   hip's
My hip's x-rays look pretty screwed up   ->   hip   's

as a bonus, x-rays IS tokenized, which is the standard in EWT and GUM but I hate it

so while the lemma issue that was originally the reason for this issue is now fixed, there's apparently a tokenization error which should be addressed as well

weird update: even adding sentence final punctuation doesn't help as I would have expected

Of all the antennae in Star Trek, the ones I like most are Jennifer's.   ->   Jennifer's
These screwed up looking x-rays are my hip's.   ->   hip's
The cell hidden in the toilet is Walt's.   ->   Hank's
AngledLuffa commented 1 year ago

I came up with a few sentences of my own, and grepped through Wikipedia for a few short sentences, and in each of the following cases the tokenizer gets it wrong:

This cat is Gaurav's.
I will chain my bike to Logan's.
These skates are Billy's.
These screwed up x-rays are my hip's.
The bigger bike is Logan's, the smaller bike is Tommy's.
Cal's football team is better than Stanford's.
Deanna is spending the night at Riker's.
In "On the Soul" , Aristotle famously criticizes Plato's theory of the soul and develops his own in response to Plato's.
His skills as a physician surpassed that of Apollo's.
God accepted his offering but not his brother's.
Lee's army, thinned by desertion and casualties, was now much smaller than Grant's.
During the 25th Dynasty, Pharaoh Taharqa created an empire nearly as large as the New Kingdom's.
They first accomplished this by replicating an experiment of Henry Cavendish's.
Jezebel's death, however, was more dramatic than Ahab's.
After Abu Bakr's family arrived in Medina, he bought another house near Muhammad's.
He earned the astronauts' respect and admiration, including Schirra's.
Its metallicity is about 30% lower than the Sun's.

I suspect this is because in quite a few instances, 1950's or something like that is treated as a single token in the training data at the end of a sentence.

The constituency parser comes up with the following, most of which look correct:

(ROOT
  (S
    (NP (DT This) (NN cat))
    (VP
      (VBZ is)
      (NP (NNP Gaurav) (POS 's)))
    (. .)))

(ROOT
  (S
    (NP (PRP I))
    (VP
      (MD will)
      (VP
        (VB chain)
        (NP (PRP$ my) (NN bike))
        (PP
          (IN to)
          (NP (NNP Logan) (POS 's)))))
    (. .)))

(ROOT
  (S
    (NP (DT These) (NNS skates))
    (VP
      (VBP are)
      (NP (NNP Billy) (POS 's)))
    (. .)))

(ROOT
  (S
    (NP
      (DT These)
      (ADJP (VBN screwed) (RP up))
      (NN x)
      (HYPH -)
      (NNS rays))
    (VP
      (VBP are)
      (NP (PRP$ my) (NN hip) (POS 's)))
    (. .)))

(ROOT
  (S
    (S
      (NP (DT The) (JJR bigger) (NN bike))
      (VP
        (VBZ is)
        (NP (NNP Logan's))))
    (, ,)
    (S
      (NP (DT the) (JJR smaller) (NN bike))
      (VP
        (VBZ is)
        (NP (NNP Tommy) (POS 's))))
    (. .)))

(ROOT
  (S
    (NP
      (NP (NNP Cal) (POS 's))
      (NN football)
      (NN team))
    (VP
      (VBZ is)
      (ADJP
        (ADJP (JJR better))
        (PP
          (IN than)
          (NP (NNP Stanford) (POS 's)))))
    (. .)))

(ROOT
  (S
    (NP (NNP Deanna))
    (VP
      (VBZ is)
      (VP
        (VBG spending)
        (NP (DT the) (NN night))
        (PP
          (IN at)
          (NP (NNP Riker) (POS 's)))))
    (. .)))

(ROOT
  (S
    (PP
      (IN In)
      (NP
        (`` ")
        (PP
          (IN On)
          (NP (DT the) (NNP Soul)))
        ('' ")))
    (, ,)
    (NP (NNP Aristotle))
    (ADVP (RB famously))
    (VP
      (VP
        (VBZ criticizes)
        (NP
          (NP
            (NP (NNP Plato) (POS 's))
            (NN theory))
          (PP
            (IN of)
            (NP (DT the) (NN soul)))))
      (CC and)
      (VP
        (VBZ develops)
        (NP (PRP$ his) (JJ own))
        (PP
          (IN in)
          (NP
            (NP (NN response))
            (PP
              (IN to)
              (NP (NNP Plato) (POS 's)))))))
    (. .)))

(ROOT
  (S
    (NP
      (NP (PRP$ His) (NNS skills))
      (PP
        (IN as)
        (NP (DT a) (NN physician))))
    (VP
      (VBD surpassed)
      (NP
        (NP (DT that))
        (PP
          (IN of)
          (NP (NNP Apollo) (POS 's)))))
    (. .)))

(ROOT
  (S
    (NP (NNP God))
    (VP
      (VBD accepted)
      (NP
        (NP (PRP$ his) (NN offering))
        (CC but)
        (RB not)
        (NP (PRP$ his) (NN brother) (POS 's))))
    (. .)))

(ROOT
  (S
    (NP
      (NP
        (NP (NNP Lee) (POS 's))
        (NN army))
      (, ,)
      (VP
        (VBN thinned)
        (PP
          (IN by)
          (NP (NN desertion) (CC and) (NNS casualties))))
      (, ,))
    (VP
      (VBD was)
      (ADVP (RB now))
      (ADJP
        (ADJP (RB much) (JJR smaller))
        (PP
          (IN than)
          (NP (NNP Grant) (POS 's)))))
    (. .)))

(ROOT
  (S
    (PP
      (IN During)
      (NP (DT the) (JJ 25th) (NNP Dynasty)))
    (, ,)
    (NP (NNP Pharaoh) (NNP Taharqa))
    (VP
      (VBD created)
      (NP
        (NP (DT an) (NN empire))
        (ADJP
          (ADJP (RB nearly) (RB as) (JJ large))
          (PP
            (IN as)
            (NP (DT the) (NNP New) (NNP Kingdom) (POS 's))))))
    (. .)))

(ROOT
  (S
    (NP (PRP They))
    (ADVP (RB first))
    (VP
      (VBD accomplished)
      (NP (DT this))
      (PP
        (IN by)
        (S
          (VP
            (VBG replicating)
            (NP
              (NP (DT an) (NN experiment))
              (PP
                (IN of)
                (NP (NNP Henry) (NNP Cavendish) (POS 's))))))))
    (. .)))

(ROOT
  (S
    (NP
      (NP (NNP Jezebel) (POS 's))
      (NN death))
    (, ,)
    (ADVP (RB however))
    (, ,)
    (VP
      (VBD was)
      (ADJP
        (ADJP (RBR more) (JJ dramatic))
        (PP
          (IN than)
          (NP (NNP Ahab) (POS 's)))))
    (. .)))

(ROOT
  (S
    (SBAR
      (IN After)
      (S
        (NP
          (NP (NNP Abu) (NNP Bakr) (POS 's))
          (NN family))
        (VP
          (VBD arrived)
          (PP
            (IN in)
            (NP (NNP Medina))))))
    (, ,)
    (NP (PRP he))
    (VP
      (VBD bought)
      (NP (DT another) (NN house))
      (PP
        (IN near)
        (NP (NNP Muhammad) (POS 's))))
    (. .)))

(ROOT
  (S
    (NP (PRP He))
    (VP
      (VBD earned)
      (NP
        (NP
          (NP (DT the) (NNS astronauts) (POS '))
          (NN respect)
          (CC and)
          (NN admiration))
        (, ,)
        (PP
          (VBG including)
          (NP (NNP Schirra) (POS 's)))))
    (. .)))

(ROOT
  (S
    (NP (PRP$ Its) (NN metallicity))
    (VP
      (VBZ is)
      (ADJP
        (ADJP
          (NP
            (QP (RB about) (CD 30))
            (NN %))
          (JJR lower))
        (PP
          (IN than)
          (NP (DT the) (NNP Sun) (POS 's)))))
    (. .)))

The only thing I really question is if x-rays should have its own constituent or not. Maybe not. I'll look for some other sentences from Wikipedia which might be trickier for the parser, so we can fix the tokenization and a weird parse for a 2-1

Here are a few sentences which might qualify, along with their parses:

Returning to Assisi, he traversed the city begging stones for the restoration of St. Damiano's.
Preserving the orthodoxy of the relationship between God and mathematics, although not in the same form as held by his critics, was long a concern of Cantor's.
It is to Bernini that is due the lion's share of responsibility for the final and enduring aesthetic appearance and emotional impact of St. Peter's.
No details survive about Chaumpaigne's service or how she came to leave Staundon 's employ for Chaucer's.
Kasparov's attacking style of play has been compared by many to Alekhine's.
These screwed up x-rays are my hip's.

# not sure this is the right way to parse a phrase such as Returning, but maybe
# some early sentences in PTB with a similar structure are "Keeping the mood light, the two then ..."
# and "Stuffing a wad of Red Man into his cheek, he admits ..."
# searching for
# /.*ing/ !, __
(ROOT
  (S
    (S
      (VP
        (VBG Returning)
        (PP
          (IN to)
          (NP (NNP Assisi)))))
    (, ,)
    (NP (PRP he))
    (VP
      (VBD traversed)
      (NP (DT the) (NN city))
      (S
        (VP
          (VBG begging)
          (NP
            (NP (NNS stones))
            (PP
              (IN for)
              (NP
                (NP (DT the) (NN restoration))
                (PP
                  (IN of)
                  (NP (NNP St.) (NNP Damiano) (POS 's)))))))))
    (. .)))

# here though, the S - VP structure for the noun form of the verb is a little weird to me
# Preserving ... is used as noun, despite extra phrase between "Preserving" and "was long"
# consider "Ballooning at ... 6 am held all the attraction ..."
# but then, since it's being used as a verb acting on a noun, it might be correct after all
# consider "Defining combat aircraft is even tougher"
# or "Finding him became ..."
(ROOT
  (S
    (S
      (VP
        (VBG Preserving)
        (NP
          (NP (DT the) (NN orthodoxy))
          (PP
            (IN of)
            (NP
              (NP (DT the) (NN relationship))
              (PP
                (IN between)
                (NP (NNP God) (CC and) (NN mathematics))))))
        (, ,)
        (SBAR
          (IN although)
          (RB not)
          (PP
            (IN in)
            (NP
              (NP (DT the) (JJ same) (NN form))
              (SBAR
                (IN as)
                (S
                  (VP
                    (VBN held)
                    (PP
                      (IN by)
                      (NP (PRP$ his) (NNS critics)))))))))
        (, ,)))
    (VP
      (VBD was)
      (ADVP (RB long))
      (NP
        (NP (DT a) (NN concern))
        (PP
          (IN of)
          (NP (NNP Cantor) (POS 's)))))
    (. .)))

# the structure under the SBAR looks weird.  need to look up PTB trees which might resemble this
# "is due ..." does not happen in this context,
# but "is owed $$$" happens
# and the structure is (VP (VBZ is) (VP (VBN owed) (NP stuff)))
# as opposed to this parse, which has the (NP stuff) up a level higher
(ROOT
  (S
    (NP (PRP It))
    (VP
      (VBZ is)
      (PP
        (IN to)
        (NP (NNP Bernini)))
      (SBAR
        (WHNP (WDT that))
        (S
          (VP
            (VBZ is)
            (VP (JJ due))
            (NP
              (NP
                (NP (DT the) (NN lion) (POS 's))
                (NN share))
              (PP
                (IN of)
                (NP
                  (NP (NN responsibility))
                  (PP
                    (IN for)
                    (NP
                      (NP
                        (DT the)
                        (ADJP (JJ final) (CC and) (JJ enduring))
                        (NML
                          (NML (JJ aesthetic) (NN appearance))
                          (CC and)
                          (NML (JJ emotional) (NN impact))))
                      (PP
                        (IN of)
                        (NP (NNP St.) (NNP Peter) (POS 's))))))))))))
    (. .)))

# i wonder if the outer PP around the PP and SBAR is correct
(ROOT
  (S
    (S
      (NP (DT The) (NN case))
      (VP
        (VBD was)
        (ADVP (RB never))
        (VP (VBN prosecuted))))
    (CC and)
    (S
      (NP
        (NP (DT no) (NNS details)))
      (VP
        (VBP survive)
        (PP
          (PP
            (IN about)
            (NP
              (NP (NNP Chaumpaigne) (POS 's))
              (NN service)))
          (CC or)
          (SBAR
            (WHADVP (WRB how))
            (S
              (NP (PRP she))
              (VP
                (VBD came)
                (S
                  (VP
                    (TO to)
                    (VP
                      (VB leave)
                      (NP
                        (NP (NNP Staundon) (POS 's))
                        (NN employ))
                      (PP
                        (IN for)
                        (NP (NNP Chaucer) (POS 's))))))))))))
    (. .)))

# should "of play" be connected to "style" instead of at a level higher up?
(ROOT
  (S
    (NP
      (NP
        (NP (NNP Kasparov) (POS 's))
        (VBG attacking)
        (NN style))
      (PP
        (IN of)
        (NP (NN play))))
    (VP
      (VBZ has)
      (VP
        (VBN been)
        (VP
          (VBN compared)
          (PP
            (IN by)
            (NP (JJ many)))
          (PP
            (IN to)
            (NP (NNP Alekhine) (POS 's))))))
    (. .)))

# wondering if x-rays should have its own node
(ROOT
  (S
    (NP
      (DT These)
      (ADJP (VBN screwed) (RP up))
      (NN x)
      (HYPH -)
      (NNS rays))
    (VP
      (VBP are)
      (NP (PRP$ my) (NN hip) (POS 's)))
    (. .)))