languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.84k stars 1.38k forks source link

Different results: command-line vs. XML tests #6300

Open jaumeortola opened 2 years ago

jaumeortola commented 2 years ago

After these changes to PatternRule.java (https://github.com/languagetool-org/languagetool/commit/d5868d9a73ab8a7d19b7f75e4a5b72fe0ecf76f0), some antipatterns didn't work as expected. We get these false positives in French.

FRENCH_WORD_REPEAT_RULE[2]

Fête des mères et remise de l'insigne " Morts pour la France ".

ACCORD_V_QUESTION2[1]

D'autre part je ne soutiens pas du tout le système actuel en france mais je sais qu'au train où l'on va que notre prochaine étape sera celle de la Grèce ou de l'Argentine.

But these matches appear only in the command line. These sentences don't produce any match in the XML example tests. So, the immunization with antipatterns produces different results for XML tests and for the command line.

It would be interesting to know the cause of this difference. It could help to discover the origin of other unexpected results like these: https://github.com/languagetool-org/languagetool/issues/5252.

@arysin Do you have any clue?

arysin commented 2 years ago

I have local changes in the core to catch when inflection in the rule suggestions don't work right but to prevent many false positives I had to also make the change to run antipatterns before regular ones. So the main change in Pattern rule to run antipatterns before regular ones. Technically both should be applicable to all languages (and all rules pass) but the change was too big to push so I kept it local (the other 1-line change was in MatchState). Apologies for accidentally pushing the change in. I am not sure why results would be different between XML and command line though.

arysin commented 2 years ago

My only guess is that xml rules can't cover all possible scenarios - I run regression tests via command line on huge corpus for Ukrainian and I often catch things that the xml rules didn't (sometimes based on that I'd add xml rule tests). I don't know French but wondering if the cases from command line are not covered.

jaumeortola commented 2 years ago

With the code by arysin, there is a strange bug. With this sentence and checking this rule:

ACCORD_V_QUESTION2[1]

D'autre part je ne soutiens pas du tout le système actuel en france mais je sais qu'au train où l'on va que notre prochaine étape sera celle de la Grèce ou de l'Argentine.

After matcher.call, the analyzed sentence is corrupted.

je[je/R pers suj 1 s] ne[ne/A] soutiens[soutenir/V ind pres 1 s, soutenir/V ind pres 2 s]

becomes soutiens[je/R pers suj 1 s, soutenir/V ind pres 2 s] ne[ne/A] soutiens[soutenir/V ind pres 1 s, soutenir/V ind pres 2 s]

arysin commented 2 years ago

@jaumeortola Can you please try to show disambig rules applies on this sentence (I think "-v" should do it).

jaumeortola commented 2 years ago

There is nothing wrong in the disambiguation rules. What is corrupted is the immunized sentence, and this prevents the antipatterns from working. The immunized sentence can be seen only when debugging.

Expected text language: French
Working on STDIN...
<S> D'[de/P,containsTypewriterApostrophe]autre[autre/J e s,autre/R e s,autre/_GN_MS,autre/_GN_FS] part[part/N e s,partir/V ind pres 3 s,part/_GN_MS,part/_GN_FS] je[je/R pers suj 1 s] ne[ne/A] soutiens[soutenir/V ind pres 1 s,soutenir/V ind pres 2 s] pas[pas/A] du[du/A] tout[tout/null] le[le/D m s,le/_GN_MS,le/_GN_MS] système[système/N m s,système/_GN_MS,système/_GN_MS] actuel[actuel/J m s,actuel/N m s,actuel/_GN_MS] en[en/P] france[france/null] mais[mais/C coor] je[je/R pers suj 1 s] sais[savoir/V ind pres 1 s] qu'[que/C sub,containsTypewriterApostrophe]au[à+le/P+D m s] train[train/N m s] où[où/A inte,où/R rel e sp] l'[le/R pers obj 3 e s,containsTypewriterApostrophe]on[on/R pers suj 3 e s] va[aller/V ind pres 3 s] que[que/C sub] notre[notre/D e s,notre/_GN_FS] prochaine[prochain/J f s,prochaine/_GN_FS,prochaine/_GN_FS] étape[étape/N f s,étape/_GN_FS,étape/_GN_FS] sera[être/V etre ind futu 3 s] celle[celle/R dem f s] de[de/P] la[le/D f s,la/_GN_FS] Grèce[Grèce/N f sp,Grèce/_GN_FS] ou[ou/C coor] de[de/P] l'[le/D e s,l'/_GN_FS,containsTypewriterApostrophe]Argentine[Argentine/N f sp,argentin/J f s,argentin/N f s,Argentine/_GN_FS].[./M fin,</S>]<P/> 
Disambiguator log: 

PREPOSITIONS[2]: D'[de/D e sp*,de/P*,containsTypewriterApostrophe] -> D'[de/P*,containsTypewriterApostrophe]

NOMINAL_GROUPS[25]: autre[autre/J e s*,autre/R e s*] -> autre[autre/J e s*,autre/R e s*,autre/_GN_MS*]
NOMINAL_GROUPS[26]: autre[autre/J e s*,autre/R e s*,autre/_GN_MS*] -> autre[autre/J e s*,autre/R e s*,autre/_GN_MS*,autre/_GN_FS*]

NOMINAL_GROUPS[25]: part[part/N e s,partir/V ind pres 3 s] -> part[part/N e s,partir/V ind pres 3 s,part/_GN_MS]
NOMINAL_GROUPS[26]: part[part/N e s,partir/V ind pres 3 s,part/_GN_MS] -> part[part/N e s,partir/V ind pres 3 s,part/_GN_MS,part/_GN_FS]

RP-NEGATION[2]: ne[ne/A] -> ne[ne/A]
RB-ADVERBES[1]: ne[ne/A] -> ne[ne/A]

NE_V[1]: soutiens[soutien/N m p,soutenir/V imp pres 2 s,soutenir/V ind pres 1 s,soutenir/V ind pres 2 s] -> soutiens[soutenir/V imp pres 2 s,soutenir/V ind pres 1 s,soutenir/V ind pres 2 s]
PRONOM_SUJET_VERB[1]: soutiens[soutenir/V imp pres 2 s,soutenir/V ind pres 1 s,soutenir/V ind pres 2 s] -> soutiens[soutenir/V ind pres 1 s,soutenir/V ind pres 2 s]

The problem arises just before immunizing the sentence (after matcher.call). This difference is before and after immunizing.

imatge

arysin commented 2 years ago

I just wanted to apologize - I was going to look deeper into this but with the war it's really hard to find time and the right state of mind.