Closed muchang closed 1 year ago
The primary problem is there are no examples of "wealthy" in such a context in the training data. However, there is an example of healthy/sick which is incorrectly tagged. I'll file an issue on the UD github.
# sent_id = newsgroup-groups.google.com_herpesnation_c74170a0fcfdc880_ENG_20051125_075200-0012
# text = When the healthy treat the sick with scorn and intolerance it brings us all down.
1 When when SCONJ WRB PronType=Int 4 mark 4:mark _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 healthy healthy ADJ JJ Degree=Pos 4 nsubj 4:nsubj _
4 treat treat VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 12 advcl 12:advcl:when _
5 the the DET DT Definite=Def|PronType=Art 6 det 6:det _
6 sick sick NOUN NN Number=Sing 4 obj 4:obj _
7 with with ADP IN _ 8 case 8:case _
8 scorn scorn NOUN NN Number=Sing 4 obl 4:obl:with _
9 and and CCONJ CC _ 10 cc 10:cc _
10 intolerance intolerance NOUN NN Number=Sing 8 conj 4:obl:with|8:conj:and _
11 it it PRON PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs 12 nsubj 12:nsubj _
12 brings bring VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
13 us we PRON PRP Case=Acc|Number=Plur|Person=1|PronType=Prs 12 obj 12:obj _
14 all all DET DT _ 13 det 13:det _
15 down down ADV RB _ 12 advmod 12:advmod SpaceAfter=No
16 . . PUNCT . _ 12 punct 12:punct _
Thanks, John.
BTW, although the healthy/sick is incorrectly labelled, the POS tagging works correctly if I change the wealthy
to sick
for this sentence:
$ python3 test.py
2021-06-25 14:30:34 INFO: Loading these models for language: en (English):
========================
| Processor | Package |
------------------------
| tokenize | combined |
| pos | combined |
========================
2021-06-25 14:30:34 INFO: Use device: cpu
2021-06-25 14:30:34 INFO: Loading: tokenize
2021-06-25 14:30:34 INFO: Loading: pos
2021-06-25 14:30:34 INFO: Done loading processors!
He None PRON
also None ADV
designed None VERB
furniture None NOUN
and None CCONJ
houses None NOUN
for None ADP
the None DET
sick None ADJ
. None PUNCT
$ cat test.py
import stanza
nlp = stanza.Pipeline('en',processors='tokenize,pos')
doc = nlp('He also designed furniture and houses for the sick.')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.lemma, word.pos)
Which corpuses Stanza trained on besides UD_EWT? Perhaps healthy/sick
is correctly labelled in other used corpuses.
yeah, good call. here's the problem, in GUM:
# sent_id = GUM_voyage_merida-18
# s_type = decl
# text = The wealthy constructed the grand Pasejo Montejo avenue north of the old town, inspired by the Champs-Élysées in Paris.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det Discourse=joint:25->24|Entity=(person-50
2 wealthy wealthy NOUN NNS Number=Plur 3 nsubj 3:nsubj Entity=person-50)
3 constructed construct VERB VBD Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root _
4 the the DET DT Definite=Def|PronType=Art 8 det 8:det Entity=(place-51-Paseo_de_Montejo
5 grand grand ADJ JJ Degree=Pos 8 amod 8:amod _
6 Pasejo Pasejo PROPN NNP Number=Sing 8 compound 8:compound _
7 Montejo Montejo PROPN NNP Number=Sing 6 flat 6:flat Entity=(person-32-Francisco_de_Montejo_the_Younger)
8 avenue avenue NOUN NN Number=Sing 3 obj 3:obj Entity=place-51-Paseo_de_Montejo)
9 north north ADV RB Degree=Pos 3 advmod 3:advmod _
10 of of ADP IN _ 13 case 13:case _
11 the the DET DT Definite=Def|PronType=Art 13 det 13:det Bridge=place-1-Mérida%2C_Yucatán<place-52-Mérida%2C_Yucatán|Entity=(place-52-Mérida%2C_Yucatán
12 old old ADJ JJ Degree=Pos 13 amod 13:amod _
13 town town NOUN NN Number=Sing 9 nmod 9:nmod:of Entity=place-52-Mérida%2C_Yucatán)|SpaceAfter=No
14 , , PUNCT , _ 15 punct 15:punct _
15 inspired inspire VERB VBN Tense=Past|VerbForm=Part 13 acl 13:acl Discourse=elaboration:26->25
16 by by ADP IN _ 18 case 18:case _
17 the the DET DT Definite=Def|PronType=Art 18 det 18:det Entity=(place-53-Champs-Élysées
18 Champs-Élysées Champs-Élysées PROPN NNP Number=Sing 15 obl 15:obl:by _
19 in in ADP IN _ 20 case 20:case _
20 Paris Paris PROPN NNP Number=Sing 15 obl 15:obl:in Entity=(place-54-Paris)place-53-Champs-Élysées)|SpaceAfter=No
21 . . PUNCT . _ 3 punct 3:punct _
Cool, John!
Unfortunately, when I retrained the model with the updated data, it still didn't get the correct answer. One possibility is to add more training data to improve the efficacy of the models, but it will be a little while before we do so.
Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.
EWT, GUM, PUD, and Pronouns. I don't see any other examples of "wealthy" in any of those.
On Sat, Jun 26, 2021 at 1:45 AM Chengyu Zhang @.***> wrote:
Thanks, John. Nice to know. It may be due to the fact that there are no examples of "wealthy" in such a context in the training data as you said. Do you mind telling us the list of the corpora that Stanza trained on? We could help check out whether other corpora have the same issues and clean them up.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/730#issuecomment-868970495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMGHBTQGZAS44CUHILTUWHUPANCNFSM47JDBFMQ .
I see, thank you!
The current models for 1.4.0 tag wealthy
, healthy
, sick
, and poor
correctly.
... and I think I can explain why it tags wealthy
correctly now. We added GUMReddit
to the list of training inputs, bringing the total number of instances of wealthy
in our training data to 7, and by default it fine tunes words when they have 7 or more instances in the training data.
Thanks for your update. It's interesting!
Describe the bug
For this sentence,
wealthy
is a adjective that exceptionally head a nominal phrase, which should be tagged asADJ
according to Universal DependenciesTo Reproduce
Expected behavior
wealthy
should be tagged asADJ
.Environment (please complete the following information):