languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.43k stars 1.4k forks source link

[en] false alarms EN_COMPOUNDS #1947

Closed tiff closed 5 years ago

tiff commented 5 years ago

EN_COMPOUNDS produces some false positives, because of newly added words, namely:

https://internal1.languagetool.org/regression-tests/20190916/result_en_20190916.html

The words were copied from ATD, but I don't see the same false alarms happening there.

cc @danielnaber @TiagoSantos81

tiff commented 5 years ago

I also have question marks regarding:

tiff commented 5 years ago

I did some quick research: Turns out for almost every of the added compounds we already have specific rules in grammar.xml.

See: ALONG_SIDE, ANY_WHERE, BE_CAUSE, BEFORE_HAND, BE_WARE, LIGHT_WEIGHT, WEB_SITE, NEAR_BY, SHORT_CUT, FREE_LANCER ...

I think the words were just blindly copy and pasted without checking if the cases are already caught.

TiagoSantos81 commented 5 years ago

I think the words were just blindly copy and pasted without checking if the cases are already caught.

The list was converted from atd/data/rules/prepositions.txt that contains rules in the simplified AtD format, although I have not checked if they were indeed on LT grammar.xml since simple compound rules should be on the file. It seems I duplicated effort. The same entries you removed from the English compounds, have a standing on grammar.xml similar to the one in the next post. Even if they use a the slightly harder to read XML syntax, they are equally simple and it is obvious that they provide a less adequate message.

In order to avoid further duplication of work, why not just comment out entries instead of deleting them? It is also more accountable and easy to revert, although that may defeat the purpose behind the removals.

TiagoSantos81 commented 5 years ago

With the exception of world_wide, the remaining are better in compounds.txt. # world-wide+ would do the trick there too.

        <!-- back fire::word=backfire -->
        <rule id="BACK_FIRE" name="back fire (backfire)">
            <pattern>
                <token>back</token>
                <token>fire</token>
            </pattern>
            <message>Did you mean <suggestion>backfire</suggestion>?</message>
            <example correction="backfire">His plans always <marker>back fire</marker>.</example>
        </rule>
        <!-- world wide::word=worldwide -->
        <rule id="WORLD_WIDE" name="world wide (worldwide)">
            <antipattern case_sensitive="yes">
                <token>World</token>
                <token>Wide</token>
                <token>Fund</token>
                <token regexp="yes">For|for</token><!-- Lower case 'f' is a style decision -->
                <token>Nature</token>
            </antipattern>
            <pattern>
                <marker>
                    <token>world</token>
                    <token>wide</token>
                </marker>
                <token><exception>web</exception></token> <!-- Don't suggest change for "World Wide Web" -->
            </pattern>
            <message>Did you mean <suggestion>worldwide</suggestion>?</message>
            <example correction="worldwide">There was a <marker>world wide</marker> epidemic.</example>
            <example>There was a <marker>world-wide</marker> epidemic.</example>
            <example>The '<marker>World Wide Fund For Nature</marker>' is also known as the 'World Wildlife Fund' (https://wwf.panda.org/who_we_are/wwf_offices/uk/).</example>
        </rule>
        <!-- worth while::word=worthwhile -->
        <rule id="WORTH_WHILE" name="worth while (worthwhile)">
            <pattern>
                <token>worth</token>
                <token>while</token>
            </pattern>
            <message>Did you mean <suggestion>worthwhile</suggestion>?</message>
            <example correction="worthwhile">It was a <marker>worth while</marker> endeavor.</example>
        </rule>