languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.4k stars 1.39k forks source link

[fr] false alarms in French CONFUSION_RULE #3058

Open jaumeortola opened 4 years ago

jaumeortola commented 4 years ago

There are many false alarms (after the tokenizer update). See: https://internal1.languagetool.org/regression-tests//20200610/result_fr_20200610_table.html and https://internal1.languagetool.org/regression-tests/via-http/2020-06-11/fr/result_java_CONFUSION_RULE.html The most prominent is il/ils, and others: sait/sais, vert/vers, mai/mais.

jaumeortola commented 4 years ago

These pairs have been disabled here: https://github.com/languagetool-org/languagetool/commit/1be44fceb78a86a7976792a38ef844e860654f68 Can they be re-enabled? I don't know if the change in the tokenizer has had any effect on the CONFUSION_RULE.

danielnaber commented 4 years ago

These pairs don't seem to have special chars, so I don't think the tokenizer will affect them? In other words, if they caused many false alarms before, I don't think that will be fixed.

jaumeortola commented 4 years ago

"il" and "ils" (the most frequent false alarms) appeared in "qu'il" and "qu'ils", which have been affected by the change in the tokenizer. But not the other words. If CONFUSION_RULE uses the tokenizer (which I guess it does, at least for the analyzed word), then "il/ils" can be reevaluated. But perhaps we can deal with "il/ils" just with agreement rules.

danielnaber commented 4 years ago

I just ran the re-evaluation, but there's no change, probably because we use a tokenization here that matches that of the ngram data from Google Books.