stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.29k stars 893 forks source link

POS tagging's unexpected result on ADJ word #730

Closed muchang closed 1 year ago

muchang commented 3 years ago

Describe the bug

For this sentence,

He also designed furniture and houses for the wealthy.

wealthy is a adjective that exceptionally head a nominal phrase, which should be tagged as ADJ according to Universal Dependencies

On the other hand, adjectives that exceptionally head a nominal phrase (as in the sick, the healthy) are still tagged ADJ.

To Reproduce

$ python3 test.py 
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.1.json: 139kB [00:00, 688kB/s]                                                                            
2021-06-25 11:32:45 INFO: Downloading default packages for language: en (English)...
2021-06-25 11:32:54 INFO: Finished downloading models and saved to /Users/stanza_resources.
2021-06-25 11:32:54 INFO: Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
========================

2021-06-25 11:32:54 INFO: Use device: cpu
2021-06-25 11:32:54 INFO: Loading: tokenize
2021-06-25 11:32:54 INFO: Loading: pos
2021-06-25 11:32:56 INFO: Done loading processors!
He None PRON
also None ADV
designed None VERB
furniture None NOUN
and None CCONJ
houses None NOUN
for None ADP
the None DET
wealthy None NOUN
. None PUNCT

$ cat test.py
import stanza

stanza.download('en')
nlp = stanza.Pipeline('en',processors='tokenize,pos')
doc = nlp('He also designed furniture and houses for the wealthy.')

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Expected behavior

wealthy should be tagged as ADJ.

Environment (please complete the following information):

AngledLuffa commented 3 years ago

The primary problem is there are no examples of "wealthy" in such a context in the training data. However, there is an example of healthy/sick which is incorrectly tagged. I'll file an issue on the UD github.

# sent_id = newsgroup-groups.google.com_herpesnation_c74170a0fcfdc880_ENG_20051125_075200-0012
# text = When the healthy treat the sick with scorn and intolerance it brings us all down.
1       When    when    SCONJ   WRB     PronType=Int    4       mark    4:mark  _
2       the     the     DET     DT      Definite=Def|PronType=Art       3       det     3:det   _
3       healthy healthy ADJ     JJ      Degree=Pos      4       nsubj   4:nsubj _
4       treat   treat   VERB    VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        12      advcl   12:advcl:when   _
5       the     the     DET     DT      Definite=Def|PronType=Art       6       det     6:det   _
6       sick    sick    NOUN    NN      Number=Sing     4       obj     4:obj   _
7       with    with    ADP     IN      _       8       case    8:case  _
8       scorn   scorn   NOUN    NN      Number=Sing     4       obl     4:obl:with      _
9       and     and     CCONJ   CC      _       10      cc      10:cc   _
10      intolerance     intolerance     NOUN    NN      Number=Sing     8       conj    4:obl:with|8:conj:and   _
11      it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  12      nsubj   12:nsubj        _
12      brings  bring   VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    0:root  _
13      us      we      PRON    PRP     Case=Acc|Number=Plur|Person=1|PronType=Prs      12      obj     12:obj  _
14      all     all     DET     DT      _       13      det     13:det  _
15      down    down    ADV     RB      _       12      advmod  12:advmod       SpaceAfter=No
16      .       .       PUNCT   .       _       12      punct   12:punct        _
muchang commented 3 years ago

Thanks, John.

BTW, although the healthy/sick is incorrectly labelled, the POS tagging works correctly if I change the wealthy to sick for this sentence:

$ python3 test.py 
2021-06-25 14:30:34 INFO: Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
========================

2021-06-25 14:30:34 INFO: Use device: cpu
2021-06-25 14:30:34 INFO: Loading: tokenize
2021-06-25 14:30:34 INFO: Loading: pos
2021-06-25 14:30:34 INFO: Done loading processors!
He None PRON
also None ADV
designed None VERB
furniture None NOUN
and None CCONJ
houses None NOUN
for None ADP
the None DET
sick None ADJ
. None PUNCT

$ cat test.py 
import stanza

nlp = stanza.Pipeline('en',processors='tokenize,pos')
doc = nlp('He also designed furniture and houses for the sick.')

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos)

Which corpuses Stanza trained on besides UD_EWT? Perhaps healthy/sick is correctly labelled in other used corpuses.

AngledLuffa commented 3 years ago

yeah, good call. here's the problem, in GUM:

# sent_id = GUM_voyage_merida-18
# s_type = decl
# text = The wealthy constructed the grand Pasejo Montejo avenue north of the old town, inspired by the Champs-Élysées in Paris.
1       The     the     DET     DT      Definite=Def|PronType=Art       2       det     2:det   Discourse=joint:25->24|Entity=(person-50
2       wealthy wealthy NOUN    NNS     Number=Plur     3       nsubj   3:nsubj Entity=person-50)
3       constructed     construct       VERB    VBD     Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin   0       root    0:root  _
4       the     the     DET     DT      Definite=Def|PronType=Art       8       det     8:det   Entity=(place-51-Paseo_de_Montejo
5       grand   grand   ADJ     JJ      Degree=Pos      8       amod    8:amod  _
6       Pasejo  Pasejo  PROPN   NNP     Number=Sing     8       compound        8:compound      _
7       Montejo Montejo PROPN   NNP     Number=Sing     6       flat    6:flat  Entity=(person-32-Francisco_de_Montejo_the_Younger)
8       avenue  avenue  NOUN    NN      Number=Sing     3       obj     3:obj   Entity=place-51-Paseo_de_Montejo)
9       north   north   ADV     RB      Degree=Pos      3       advmod  3:advmod        _
10      of      of      ADP     IN      _       13      case    13:case _
11      the     the     DET     DT      Definite=Def|PronType=Art       13      det     13:det  Bridge=place-1-Mérida%2C_Yucatán<place-52-Mérida%2C_Yucatán|Entity=(place-52-Mérida%2C_Yucatán
12      old     old     ADJ     JJ      Degree=Pos      13      amod    13:amod _
13      town    town    NOUN    NN      Number=Sing     9       nmod    9:nmod:of       Entity=place-52-Mérida%2C_Yucatán)|SpaceAfter=No
14      ,       ,       PUNCT   ,       _       15      punct   15:punct        _
15      inspired        inspire VERB    VBN     Tense=Past|VerbForm=Part        13      acl     13:acl  Discourse=elaboration:26->25
16      by      by      ADP     IN      _       18      case    18:case _
17      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  Entity=(place-53-Champs-Élysées
18      Champs-Élysées  Champs-Élysées  PROPN   NNP     Number=Sing     15      obl     15:obl:by       _
19      in      in      ADP     IN      _       20      case    20:case _
20      Paris   Paris   PROPN   NNP     Number=Sing     15      obl     15:obl:in       Entity=(place-54-Paris)place-53-Champs-Élysées)|SpaceAfter=No
21      .       .       PUNCT   .       _       3       punct   3:punct _
muchang commented 3 years ago

Cool, John!

AngledLuffa commented 3 years ago

Unfortunately, when I retrained the model with the updated data, it still didn't get the correct answer. One possibility is to add more training data to improve the efficacy of the models, but it will be a little while before we do so.

muchang commented 3 years ago

Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.

AngledLuffa commented 3 years ago

EWT, GUM, PUD, and Pronouns. I don't see any other examples of "wealthy" in any of those.

On Sat, Jun 26, 2021 at 1:45 AM Chengyu Zhang @.***> wrote:

Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/730#issuecomment-868970495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMGHBTQGZAS44CUHILTUWHUPANCNFSM47JDBFMQ .

muchang commented 3 years ago

I see, thank you!

AngledLuffa commented 2 years ago

The current models for 1.4.0 tag wealthy, healthy, sick, and poor correctly.

AngledLuffa commented 2 years ago

... and I think I can explain why it tags wealthy correctly now. We added GUMReddit to the list of training inputs, bringing the total number of instances of wealthy in our training data to 7, and by default it fine tunes words when they have 7 or more instances in the training data.

muchang commented 2 years ago

Thanks for your update. It's interesting!