joliciel-informatique / talismane

NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
https://github.com/urieli/talismane/wiki
GNU Affero General Public License v3.0
49 stars 14 forks source link

Parsing errors, version 5.1.1, Fench #31

Open abalvet opened 6 years ago

abalvet commented 6 years ago

"Les poules du couvent couvent." should return: 1 Les le DET det p 2 det 2 poules poule NC nc fp 5 suj 3 du de P+D P+D ms 2 dep 4 couvent couvent NC nc ms 3 obj 5 couvent couver V v PS3p 0 root 6 . . PONCT PONCT null 5 punct

This is what I get:

1 Les les DET DET n=p 2 det 2 det 2 poules poule NC NC n=p|g=f 0 0 3 du de P+D P+D n=s|g=m 2 dep 2 dep 4 couvent couvent NC NC n=s|g=m 3 prep 3 prep 5 couvent couvent NC NC n=s|g=m 4 mod 4 mod <= VERY wrong here due to tagging error 6 . . PONCT PONCT 5 ponct 5 ponct

This is not the only error/parsing issue. Talismane doesn't seem to distinguish indirect objects from PP modifiers. Plus, it attaches the final punctuation to the last token, while Malt attaches the final punctuation to the last verbal root. Talismane seems to systematically attach the NP to the last Prep, tagging its function as "prep" (prepositional object?), while Malt seems to systematically consider Preps as modifiers of the last v root, with the head Noun being tagged as "obj". Malt doesn't seem to distinguish between indirect objects and PP modifiers, either. Maybe this comes from the dependency analyses of the FTB? Sometimes, the form of the preposition seems to have some influence on the function of the head noun. Talismane, as well as Malt, does not distinguish transitive verbs from intransitive ones: in "Maurice dort le matin", "le matin" should be tagged "mod", not "obj".

Here is a side-by-side comparison with Malt (MaltParser 1.9.2 + fremalt-1.7.mco) on a set of very simple sentences: ID TOKEN LEMMA TAG_RED TAG_EXT ID_REL REL ID_REL REL TALISMANE MALT
1 Maurice Maurice NPP NPP 2 suj 2 suj 2 accorde accorder V V 0 root 0 root 3 sa sa DET DET 4 det 4 det 4 guitare guitare NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 l' le DET DET 7 det 7 det 7 oreille oreille NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 accorde accorder V V 0 root 0 root 3 sa sa DET DET 4 det 4 det 4 confiance confiance NC NC 2 obj 2 obj 5 à à P P 2 mod 4 dep 6 la la DET DET 7 det 7 det 7 directrice directrice NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 accorde accorder V V 0 root 0 root 3 sa sa DET DET 4 det 4 det 4 confiance confiance NC NC 2 obj 2 obj 5 au à P+D P+D 4 dep 2 mod 6 directeur directeur NC NC 5 prep 5 obj 7 . . PONCT PONCT 6 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 achète acheter V V 0 root 0 root 3 une une DET DET 4 det 4 det 4 voiture voiture NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 sa sa DET DET 7 det 7 det 7 femme femme NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 achète acheter V V 0 root 0 root 3 une une DET DET 4 det 4 det 4 voiture voiture NC NC 2 obj 2 obj 5 pour pour P P 2 mod 2 mod 6 sa sa DET DET 7 det 7 det 7 femme femme NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 achète acheter V V 0 root 0 root 3 une une DET DET 4 det 4 det 4 voiture voiture NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 la la DET DET 7 det 7 det 7 représentante représentant NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 achète acheter V V 0 root 0 root 3 une une DET DET 4 det 4 det 4 voiture voiture NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 la la DET DET 7 det 7 det 7 sauvette sauvette NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 achète acheter V V 0 root 0 root 3 une une DET DET 4 det 4 det 4 voiture voiture NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 l' le DET DET 7 det 7 det 7 étranger étranger NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 dort dormir V V 0 root 0 root 3 le le DET DET 4 det 4 det 4 matin matin NC NC 2 obj 2 obj 5 jusqu' jusque P P 2 mod 2 mod 6 à à P P 5 prep 5 obj 7 10 10 ADJ ADJ 8 mod 8 mod 8 heures heure NC NC 6 prep 6 obj 9 . . PONCT PONCT 8 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 dort dormir V V 0 root 0 root 3 dans dans P P 2 mod 2 mod 4 son son DET DET 5 det 5 det 5 jardin jardin NC NC 3 prep 3 obj 6 . . PONCT PONCT 5 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 respire respirer V V 0 root 0 root 3 la la DET DET 4 det 4 det 4 santé santé NC NC 2 obj 2 obj 5 . . PONCT PONCT 4 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 respire respirer V V 0 root 0 root 3 la la DET DET 5 det 5 det 4 bonne bon ADJ ADJ 5 mod 5 mod 5 odeur odeur NC NC 2 obj 2 obj 6 du de P+D P+D 5 dep 5 dep 7 gâteau gâteau NC NC 6 prep 6 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 sent sentir V V 0 root 0 root 3 la la DET DET 4 det 4 det 4 bière bière NC NC 2 obj 2 obj 5 . . PONCT PONCT 4 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 donne donner V V 0 root 0 root 3 un un DET DET 4 det 4 det 4 livre livre NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 son son DET DET 7 det 7 det 7 frère frère NC NC 5 prep 5 obj 8 . . PONCT PONCT 7 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 donne donner V V 0 root 0 root 3 un un DET DET 4 det 4 det 4 livre livre NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 lire lire VINF VINF 5 prep 5 obj 7 à à P P 6 mod 6 mod 8 son son DET DET 9 det 9 det 9 frère frère NC NC 7 prep 7 obj 10 . . PONCT PONCT 9 ponct 2 ponct

1 Maurice Maurice NPP NPP 2 suj 2 suj 2 donne donner V V 0 root 0 root 3 un un DET DET 4 det 4 det 4 livre livre NC NC 2 obj 2 obj 5 à à P P 2 mod 2 mod 6 son son DET DET 7 det 7 det 7 frère frère NC NC 5 prep 5 obj 8 à à P P 2 mod 2 mod 9 la la DET DET 10 det 10 det 10 fin fin NC NC 8 prep 8 obj 11 de de P P 2 mod 10 dep 12 la la DET DET 13 det 13 det 13 journée journée NC NC 11 prep 11 obj 14 . . PONCT PONCT 13 ponct 2 ponct

1 À à P P 9 mod 9 mod 2 la la DET DET 3 det 3 det 3 fin fin NC NC 1 prep 1 obj 4 de de P P 3 dep 3 dep 5 la la DET DET 6 det 6 det 6 journée journée NC NC 4 prep 4 obj 7 0 0 PONCT PONCT 6 ponct 9 ponct 8 Maurice Maurice NPP NPP 9 suj 9 suj 9 donne donner V V 0 root 0 root 10 un un DET DET 11 det 11 det 11 livre livre NC NC 9 obj 9 obj 12 à à P P 9 mod 9 mod 13 son son DET DET 14 det 14 det 14 frère frère NC NC 12 prep 12 obj 15 . . PONCT PONCT 14 ponct 9 ponct

urieli commented 6 years ago

You report various issues here - in the future, it would be better to separate them into separate issues.

I'll try to tackle them one at a time, but first of all, a general comment: Talismane and Malt are both based on supervised machine learning, and can only reproduce what they find in their training corpora. Because of copyright issues, we cannot (unfortunately) publish our training corpora. Most of these issues would be far better handled in a discussion of dependency annotation guidelines for French specifically targeting a "gold" standard corpus, as they have little to do with Talismane as software.

Issue 1: "Les poules du couvent couvent." This can be fixed by downloading v5.1.2 and running the command with a higher beam width (option --beamWidth=5). A beam width gives you a trade-off between parsing speed and parsing accuracy. The higher the beam width, the slower the parse and the more accurate. At beam width = 1 (the default), the parser has to take the 1st option produced by the pos-tagger, and is highly sensitive to pos-tagger errors. At higher beam widths, the parser select among various options produced by the pos-tagger.

Note that higher beam widths had a bug in v5.1.1 which was fixed in v5.1.2.

Issue 2: Talismane doesn't seem to distinguish indirect objects from PP modifiers. The examples you give show that this is not systematically true. My assumption is that this can only be corrected by feeding Talismane with many more correctly annotated examples during training.

Issue 3: Talismane attaches the final punctuation to the last token, while Malt attaches the final punctuation to the last verbal root. In the FTB, punctuation is attached haphazardly, and there is nothing in the annotation guide to indicate where it should be attached. Talismane makes the simplifying assumption that punctuation should be systematically attached to the previous non-punctuation token. I currently see no reason to consider this wrong.

Issue 4: Talismane seems to systematically attach the NP to the last Prep, tagging its function as "prep" See https://github.com/joliciel-informatique/talismane/blob/master/talismane_core/languagePacks/french/languagePack/talismaneDependencyLabels_fr.txt#L41

Issue 5: Talismane, as well as Malt, does not distinguish transitive verbs from intransitive ones I agree with your analysis, and can only assume additional training examples would correct this.