stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.22k stars 888 forks source link

French regression between Stanza 1.8.1 and 1.8.2 #1404

Open blegaut opened 2 months ago

blegaut commented 2 months ago

Describe the bug Take the following sentence: Assurez-vous d'être à l'heure !

The word vous has a wrong dependency relation with Stanza 1.8.2, but correct with Stanza 1.8.1 Stanza 1.8.1 :

          {
            "id": 2,
            "text": "-vous",
            "lemma": "vous",
            "upos": "PRON",
            "feats": "Emph=No|Number=Plur|Person=2|PronType=Prs",
            "head": 1,
            **"deprel": "obj",**
            "start_char": 7,
            "end_char": 12,
            "ner": "O",
            "multi_ner": [
              "O"
            ]
          },

Stanza 1.8.2 :

          {
            "id": 2,
            "text": "-vous",
            "lemma": "vous",
            "upos": "PRON",
            "feats": "Emph=No|Number=Plur|Person=2|PronType=Prs",
            "head": 1,
            **"deprel": "nsubj",**
            "start_char": 7,
            "end_char": 12,
            "ner": "O",
            "multi_ner": [
              "O"
            ]
          },

To Reproduce Steps to reproduce the behavior: see above

Expected behavior I would expect the same analysis independent of the version

Environment (please complete the following information):

Additional context Add any other context about the problem here.

AngledLuffa commented 2 months ago

So, I'm not surprised there are FR changes over time. We created a "combined" FR model to be the default out of four mostly compatible treebanks.

There is exactly one line with Assurez-vous in it (zero with assurez-vous) and the dependency is actually neither obj nor nsubj:

# text = Assurez-vous de boire suffisamment (au moins un à deux verres) avant et après le traitement par Aclasta, selon les instructions de votre médecin : ceci afin d'éviter une déshydratation.
1       Assurez assurer VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       SpaceAfter=No
2       -vous   vous    PRON    _       Number=Plur|Person=2|PronType=Prs|Reflex=Yes    1       expl:pv _       _

Does this dependency look reasonable to you? At any rate, I can rebuild the FR models with the latest versions of the datasets, and perhaps it will improve performance somewhat.

blegaut commented 2 months ago

Thanks for your quick reply.

Yes expl:pv is definitively the best option here. I hope that it works when you rebuild the FR models. Please let me know how and when I can test it.

Thanks,

Bernard

AngledLuffa commented 2 months ago

Mmm, unfortunately, the models continue to call it nsubj after rebuilding with the latest versions of the git data. That's also true for the version using a transformer. One option here is to throw together a couple sentences which cover the dependency and add that to the training data. I don't know any French, so I don't think I should be the one to do it, but if you have suggested dependencies for a couple sentences, that would likely be enough.

(We could also start with parses for a couple sentences with that pair of words and correct the errors that show up.)

blegaut commented 2 months ago

Hello, I am happy to contribute by providing a couple of corrected sentences. What would be the expected format and the proper repository ?

I also noticed some other regressions after the rebuilding with the latest versions of the git data. Is there any way to access the previous versions ?

Thanks

AngledLuffa commented 2 months ago

Is there any way to access the previous versions ?

Well..... yes, that's technically possible. They should be in the HuggingFace history for the FR models. Although the idea behind making the newer models is there will be other things that work better with the updated data

https://huggingface.co/stanfordnlp/stanza-fr

If you can come up with some example regression sentences, perhaps the best format would just be text sentences (cut down so they demonstrate the error but aren't 50 words long), I'll run them through our best models, and you can let me know where you spot the errors

blegaut commented 1 month ago

Here are some example regression sentences:

Thanks,

Bernard

AngledLuffa commented 1 month ago

If I put some of these into the "accurate" models with a Transformer, it already does some of these recommendations. I can post some here:

# recommandons is the verb

# text = Nous vous recommandons vivement d'investir dans un système aux normes.
# sent_id = 0
1       Nous    nous    PRON    _       Emph=No|Number=Plur|Person=1|PronType=Prs       3       nsubj   _       start_char=0|end_char=4|ner=O
2       vous    vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       3       iobj    _       start_char=5|end_char=9|ner=O
3       recommandons    recommander     VERB    _       Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin   0       root    _       start_char=10|end_char=22|ner=O
4       vivement        vivement        ADV     _       _       3       advmod  _       start_char=23|end_char=31|ner=O
5       d'      de      ADP     _       _       6       mark    _       start_char=32|end_char=34|ner=O|SpaceAfter=No
6       investir        investir        VERB    _       VerbForm=Inf    3       xcomp   _       start_char=34|end_char=42|ner=O
7       dans    dans    ADP     _       _       9       case    _       start_char=43|end_char=47|ner=O
8       un      un      DET     _       Definite=Ind|Gender=Masc|Number=Sing|PronType=Art       9       det     _       start_char=48|end_char=50|ner=O
9       système système NOUN    _       Gender=Masc|Number=Sing 6       obl:arg _       start_char=51|end_char=58|ner=O
10-11   aux     _       _       _       _       _       _       _       start_char=59|end_char=62|ner=O
10      à       à       ADP     _       _       12      case    _       _
11      les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   12      det     _       _
12      normes  norme   NOUN    _       Gender=Fem|Number=Plur  9       nmod    _       start_char=63|end_char=69|ner=O|SpaceAfter=No
13      .       .       PUNCT   _       _       3       punct   _       start_char=69|end_char=70|ner=O|SpaceAfter=No

# Élaborez is the verb

# text = Élaborez un plan de gestion de crise.
# sent_id = 0
1       Élaborez        élaborer        VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=0|end_char=8|ner=O
2       un      un      DET     _       Definite=Ind|Gender=Masc|Number=Sing|PronType=Art       3       det     _       start_char=9|end_char=11|ner=O
3       plan    plan    NOUN    _       Gender=Masc|Number=Sing 1       obj     _       start_char=12|end_char=16|ner=O
4       de      de      ADP     _       _       5       case    _       start_char=17|end_char=19|ner=O
5       gestion gestion NOUN    _       Gender=Fem|Number=Sing  3       nmod    _       start_char=20|end_char=27|ner=O
6       de      de      ADP     _       _       7       case    _       start_char=28|end_char=30|ner=O
7       crise   crise   NOUN    _       Gender=Fem|Number=Sing  5       nmod    _       start_char=31|end_char=36|ner=O|SpaceAfter=No
8       .       .       PUNCT   _       _       1       punct   _       start_char=36|end_char=37|ner=O|SpaceAfter=No

# would you check this?

# text = Il semble que vous ne soyez pas informé.
# sent_id = 0
1       Il      lui     PRON    _       Emph=No|Gender=Masc|Number=Sing|Person=3|PronType=Prs   2       expl:subj       _       start_char=0|end_char=2|ner=O
2       semble  sembler VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       start_char=3|end_char=9|ner=O
3       que     que     SCONJ   _       _       8       mark    _       start_char=10|end_char=13|ner=O
4       vous    vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       8       nsubj:pass      _       start_char=14|end_char=18|ner=O
5       ne      ne      ADV     _       Polarity=Neg    8       advmod  _       start_char=19|end_char=21|ner=O
6       soyez   être    AUX     _       Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   8       aux:pass        _       start_char=22|end_char=27|ner=O
7       pas     pas     ADV     _       Polarity=Neg    8       advmod  _       start_char=28|end_char=31|ner=O
8       informé informer        VERB    _       Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass     2       csubj   _       start_char=32|end_char=39|ner=O|SpaceAfter=No
9       .       .       PUNCT   _       _       2       punct   _       start_char=39|end_char=40|ner=O|SpaceAfter=No

# Mettez is the verb

# text = Mettez en place des politiques de recouvrement plus strictes!
# sent_id = 0
1       Mettez  mettre  VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=0|end_char=6|ner=S-LOC
2       en      en      ADP     _       _       3       case    _       start_char=7|end_char=9|ner=O
3       place   place   NOUN    _       Gender=Fem|Number=Sing  1       obl:mod _       start_char=10|end_char=15|ner=O
4-5     des     _       _       _       _       _       _       _       start_char=16|end_char=19|ner=O
4       de      de      ADP     _       _       6       case    _       _
5       les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   6       det     _       _
6       politiques      politique       NOUN    _       Gender=Fem|Number=Plur  1       obl:arg _       start_char=20|end_char=30|ner=O
7       de      de      ADP     _       _       8       case    _       start_char=31|end_char=33|ner=O
8       recouvrement    recouvrement    NOUN    _       Gender=Masc|Number=Sing 6       nmod    _       start_char=34|end_char=46|ner=O
9       plus    plus    ADV     _       _       10      advmod  _       start_char=47|end_char=51|ner=O
10      strictes        strict  ADJ     _       Gender=Fem|Number=Plur  6       amod    _       start_char=52|end_char=60|ner=O|SpaceAfter=No
11      !       !       PUNCT   _       _       1       punct   _       start_char=60|end_char=61|ner=O|SpaceAfter=No

# experts is the subject

# text = Nos experts peuvent vous conseiller.
# sent_id = 0
1       Nos     son     DET     _       Number=Plur|Number[psor]=Plur|Person[psor]=1|Poss=Yes|PronType=Prs      2       det     _       start_char=0|end_char=3|ner=S-LOC
2       experts expert  NOUN    _       Gender=Masc|Number=Plur 3       nsubj   _       start_char=4|end_char=11|ner=O
3       peuvent pouvoir VERB    _       Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       start_char=12|end_char=19|ner=O
4       vous    vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       5       obj     _       start_char=20|end_char=24|ner=O
5       conseiller      conseiller      VERB    _       VerbForm=Inf    3       xcomp   _       start_char=25|end_char=35|ner=O|SpaceAfter=No
6       .       .       PUNCT   _       _       3       punct   _       start_char=35|end_char=36|ner=O|SpaceAfter=No
blegaut commented 1 month ago

Everything looks good ! Thank you

AngledLuffa commented 1 month ago

This is what it came up with for ...

# text = Assurez-vous d'être à l'heure !
# sent_id = 0
1       Assurez assurer VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=0|end_char=7|ner=O|SpaceAfter=No
2       -vous   vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       1       nsubj   _       start_char=7|end_char=12|ner=O
3       d'      de      ADP     _       _       4       mark    _       start_char=13|end_char=15|ner=O|SpaceAfter=No
4       être    être    AUX     _       VerbForm=Inf    1       ccomp   _       start_char=15|end_char=19|ner=O
5       à       à       ADP     _       _       7       case    _       start_char=20|end_char=21|ner=O
6       l'      le      DET     _       Definite=Def|Number=Sing|PronType=Art   7       det     _       start_char=22|end_char=24|ner=O|SpaceAfter=No
7       heure   heure   NOUN    _       Gender=Fem|Number=Sing  4       obl:arg _       start_char=24|end_char=29|ner=O
8       !       !       PUNCT   _       _       1       punct   _       start_char=30|end_char=31|ner=O|SpaceAfter=No

but you were saying the expl:pv dep is better?

Can you suggest one or two other sentences with Assurez-vous or assurez-vous in them?

blegaut commented 1 month ago

yes, sure. Here are a few sentences:

AngledLuffa commented 1 month ago
# sent_id = 0
1       Puisque puisque SCONJ   _       _       4       mark    _       start_char=0|end_char=7|ner=O
2       vous    vous    PRON    _       Number=Plur|Person=2|PronType=Prs       4       nsubj:pass      _       start_char=8|end_char=12|ner=O
3       êtes    être    AUX     _       Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   4       aux:pass        _       start_char=13|end_char=17|ner=O
4       équipé  équiper VERB    _       Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass     11      advcl   _       start_char=18|end_char=24|ner=O
5       d'      de      ADP     _       _       7       case    _       start_char=25|end_char=27|ner=O|SpaceAfter=No
6       un      un      DET     _       Definite=Ind|Gender=Masc|Number=Sing|PronType=Art       7       det     _       start_char=27|end_char=29|ner=O
7       logiciel        logiciel        NOUN    _       Gender=Masc|Number=Sing 4       obl:arg _       start_char=30|end_char=38|ner=O
8       de      de      ADP     _       _       9       case    _       start_char=39|end_char=41|ner=O
9       facturation     facturation     NOUN    _       Gender=Fem|Number=Sing  7       nmod    _       start_char=42|end_char=53|ner=O|SpaceAfter=No
10      ,       ,       PUNCT   _       _       4       punct   _       start_char=53|end_char=54|ner=O
11      assurez assurer VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=55|end_char=62|ner=O|SpaceAfter=No
12      -vous   vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       11      nsubj   _       start_char=62|end_char=67|ner=O
13      d'      de      ADP     _       _       14      mark    _       start_char=68|end_char=70|ner=O|SpaceAfter=No
14      utiliser        utiliser        VERB    _       VerbForm=Inf    11      ccomp   _       start_char=70|end_char=78|ner=O
15      le      le      DET     _       Definite=Def|Gender=Masc|Number=Sing|PronType=Art       16      det     _       start_char=79|end_char=81|ner=O
16      système système NOUN    _       Gender=Masc|Number=Sing 14      obj     _       start_char=82|end_char=89|ner=O
17      de      de      ADP     _       _       18      case    _       start_char=90|end_char=92|ner=O
18      relance relance NOUN    _       Gender=Fem|Number=Sing  16      nmod    _       start_char=93|end_char=100|ner=O
19      afin    afin    ADV     _       _       14      advmod  _       start_char=101|end_char=105|ner=O
20      de      de      ADP     _       _       21      mark    _       start_char=106|end_char=108|ner=O
21      résorber        résorber        VERB    _       VerbForm=Inf    19      ccomp   _       start_char=109|end_char=117|ner=O
22      les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   23      det     _       start_char=118|end_char=121|ner=O
23      retards retard  NOUN    _       Gender=Masc|Number=Plur 21      obj     _       start_char=122|end_char=129|ner=O
24      de      de      ADP     _       _       25      case    _       start_char=130|end_char=132|ner=O
25      paiement        paiement        NOUN    _       Gender=Masc|Number=Sing 23      nmod    _       start_char=133|end_char=141|ner=O
26      que     que     PRON    _       PronType=Rel    28      obj     _       start_char=142|end_char=145|ner=O
27      vous    vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       28      nsubj   _       start_char=146|end_char=150|ner=O
28      déplorez        déplorer        VERB    _       Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   23      acl:relcl       _       start_char=151|end_char=159|ner=O|SpaceAfter=No
29      .       .       PUNCT   _       _       11      punct   _       start_char=159|end_char=160|ner=O|SpaceAfter=No

# text = Assurez-vous de bien suivre la réglementation qui encadre votre secteur d'activité
# sent_id = 0
1       Assurez assurer VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=0|end_char=7|ner=O|SpaceAfter=No
2       -vous   vous    PRON    _       Number=Plur|Person=2|PronType=Prs       1       nsubj   _       start_char=7|end_char=12|ner=O
3       de      de      ADP     _       _       5       mark    _       start_char=13|end_char=15|ner=O
4       bien    bien    ADV     _       _       5       advmod  _       start_char=16|end_char=20|ner=O
5       suivre  suivre  VERB    _       VerbForm=Inf    1       xcomp   _       start_char=21|end_char=27|ner=O
6       la      le      DET     _       Definite=Def|Gender=Fem|Number=Sing|PronType=Art        7       det     _       start_char=28|end_char=30|ner=O
7       réglementation  réglementation  NOUN    _       Gender=Fem|Number=Sing  5       obj     _       start_char=31|end_char=45|ner=O
8       qui     qui     PRON    _       PronType=Rel    9       nsubj   _       start_char=46|end_char=49|ner=O
9       encadre encadrer        VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   7       acl:relcl       _       start_char=50|end_char=57|ner=O
10      votre   son     DET     _       Number=Sing|Poss=Yes    11      det     _       start_char=58|end_char=63|ner=O
11      secteur secteur NOUN    _       Gender=Masc|Number=Sing 9       obj     _       start_char=64|end_char=71|ner=O
12      d'      de      ADP     _       _       13      case    _       start_char=72|end_char=74|ner=O|SpaceAfter=No
13      activité        activité        NOUN    _       Gender=Fem|Number=Sing  11      nmod    _       start_char=74|end_char=82|ner=O|SpaceAfter=No

# text = Assurez-vous de couvrir les risques potentiels, y compris les incendies, les catastrophes naturelles et le vol.
# sent_id = 0
1       Assurez assurer VERB    _       Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin   0       root    _       start_char=0|end_char=7|ner=O|SpaceAfter=No
2       -vous   vous    PRON    _       Emph=No|Number=Plur|Person=2|PronType=Prs       1       nsubj   _       start_char=7|end_char=12|ner=O
3       de      de      ADP     _       _       4       mark    _       start_char=13|end_char=15|ner=O
4       couvrir couvrir VERB    _       VerbForm=Inf    1       ccomp   _       start_char=16|end_char=23|ner=O
5       les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   6       det     _       start_char=24|end_char=27|ner=O
6       risques risque  NOUN    _       Gender=Masc|Number=Plur 4       obj     _       start_char=28|end_char=35|ner=O
7       potentiels      potentiel       ADJ     _       Gender=Masc|Number=Plur 6       amod    _       start_char=36|end_char=46|ner=O|SpaceAfter=No
8       ,       ,       PUNCT   _       _       12      punct   _       start_char=46|end_char=47|ner=O
9       y       y       PRON    _       Emph=No|ExtPos=ADP|Person=3|PronType=Prs        12      case    _       start_char=48|end_char=49|ner=O
10      compris comprendre      VERB    _       Gender=Masc|Tense=Past|VerbForm=Part|Voice=Pass 9       fixed   _       start_char=50|end_char=57|ner=O
11      les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   12      det     _       start_char=58|end_char=61|ner=O
12      incendies       incendie        NOUN    _       Gender=Masc|Number=Plur 6       nmod    _       start_char=62|end_char=71|ner=O|SpaceAfter=No
13      ,       ,       PUNCT   _       _       15      punct   _       start_char=71|end_char=72|ner=O
14      les     le      DET     _       Definite=Def|Number=Plur|PronType=Art   15      det     _       start_char=73|end_char=76|ner=O
15      catastrophes    catastrophe     NOUN    _       Gender=Fem|Number=Plur  12      conj    _       start_char=77|end_char=89|ner=O
16      naturelles      naturel ADJ     _       Gender=Fem|Number=Plur  15      amod    _       start_char=90|end_char=100|ner=O
17      et      et      CCONJ   _       _       19      cc      _       start_char=101|end_char=103|ner=O
18      le      le      DET     _       Definite=Def|Gender=Masc|Number=Sing|PronType=Art       19      det     _       start_char=104|end_char=106|ner=O
19      vol     vol     NOUN    _       Gender=Masc|Number=Sing 12      conj    _       start_char=107|end_char=110|ner=O|SpaceAfter=No
20      .       .       PUNCT   _       _       1       punct   _       start_char=110|end_char=111|ner=O|SpaceAfter=No

Each of the -vous is an nsubj instead of expl:pv. Also, any thoughts on the previous one aside from the nsubj -> expl:pv change?

blegaut commented 1 month ago

I would say that the change nsubjto expl:pv is required for all occurrences of -vous. I can't see any other changes in theses sentences. Thanks

AngledLuffa commented 1 month ago

Alright, I put a candidate fake training file here:

https://github.com/stanfordnlp/handparsed-treebank/commit/0fac6a83754baf52f93eff66a5447340d06f1d3d

Any thoughts on these?

Also sent them to a former colleague who's worked on French datasets before.

AngledLuffa commented 1 month ago

If you find any other regressions, please don't hesitate to send them our way. I can rerun the depparse training with these sentences and see if it helps.

AngledLuffa commented 1 month ago

welll.... just training on those sentences isn't helping either model get the expl:pv relation in Assurez-vous. Maybe a couple more sentences would help, maybe not (there is a cutoff of 7 where it starts finetuning words, so it may indeed help to add a couple more). At any rate, I suggest using the default_accurate package, since you seemed pretty satisfied with the other parses above

AngledLuffa commented 1 month ago

Alright, I realized I had mistrained the models with the new dependencies. The new models seem to get expl:pv for a couple of the examples I tried for assurez-vous. I posted those as the new defaults. I'll send those sentences to a former colleague to see if she has any suggestions on the dependencies, just to make sure