Closed tiff closed 5 years ago
I also have question marks regarding:
I did some quick research: Turns out for almost every of the added compounds we already have specific rules in grammar.xml.
See: ALONG_SIDE, ANY_WHERE, BE_CAUSE, BEFORE_HAND, BE_WARE, LIGHT_WEIGHT, WEB_SITE, NEAR_BY, SHORT_CUT, FREE_LANCER ...
I think the words were just blindly copy and pasted without checking if the cases are already caught.
I think the words were just blindly copy and pasted without checking if the cases are already caught.
The list was converted from atd/data/rules/prepositions.txt
that contains rules in the simplified AtD format, although I have not checked if they were indeed on LT grammar.xml
since simple compound rules should be on the file. It seems I duplicated effort.
The same entries you removed from the English compounds, have a standing on grammar.xml
similar to the one in the next post. Even if they use a the slightly harder to read XML syntax, they are equally simple and it is obvious that they provide a less adequate message.
In order to avoid further duplication of work, why not just comment out entries instead of deleting them? It is also more accountable and easy to revert, although that may defeat the purpose behind the removals.
With the exception of world_wide, the remaining are better in compounds.txt
.
# world-wide+
would do the trick there too.
<!-- back fire::word=backfire -->
<rule id="BACK_FIRE" name="back fire (backfire)">
<pattern>
<token>back</token>
<token>fire</token>
</pattern>
<message>Did you mean <suggestion>backfire</suggestion>?</message>
<example correction="backfire">His plans always <marker>back fire</marker>.</example>
</rule>
<!-- world wide::word=worldwide -->
<rule id="WORLD_WIDE" name="world wide (worldwide)">
<antipattern case_sensitive="yes">
<token>World</token>
<token>Wide</token>
<token>Fund</token>
<token regexp="yes">For|for</token><!-- Lower case 'f' is a style decision -->
<token>Nature</token>
</antipattern>
<pattern>
<marker>
<token>world</token>
<token>wide</token>
</marker>
<token><exception>web</exception></token> <!-- Don't suggest change for "World Wide Web" -->
</pattern>
<message>Did you mean <suggestion>worldwide</suggestion>?</message>
<example correction="worldwide">There was a <marker>world wide</marker> epidemic.</example>
<example>There was a <marker>world-wide</marker> epidemic.</example>
<example>The '<marker>World Wide Fund For Nature</marker>' is also known as the 'World Wildlife Fund' (https://wwf.panda.org/who_we_are/wwf_offices/uk/).</example>
</rule>
<!-- worth while::word=worthwhile -->
<rule id="WORTH_WHILE" name="worth while (worthwhile)">
<pattern>
<token>worth</token>
<token>while</token>
</pattern>
<message>Did you mean <suggestion>worthwhile</suggestion>?</message>
<example correction="worthwhile">It was a <marker>worth while</marker> endeavor.</example>
</rule>
EN_COMPOUNDS produces some false positives, because of newly added words, namely:
https://internal1.languagetool.org/regression-tests/20190916/result_en_20190916.html
The words were copied from ATD, but I don't see the same false alarms happening there.
cc @danielnaber @TiagoSantos81