languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.83k stars 1.38k forks source link

False positives on bibcodes #2712

Open bperel opened 4 years ago

bperel commented 4 years ago

Hello,

Perhaps an edge case, but bibcodes which are used to reference literature references about astronomical data, are counted as errors since they can contain dots, for instance 2002AJ....123..549P.

I found a regular expression to match these bibcodes :

\d{4}[A-Za-z\.\&]{5}[\w\.]{4}[ELPQ-Z\.][\d\.]{4}[A-Z]
bperel commented 4 years ago

I'm willing to create the rule myself but even after reading the Development overview page I'm not sure if this counts as a "rule" : I just don't want the "2 consecutive dots" error to occur. Does it mean that the method org.languagetool.rules.DoublePunctuationRule#match should be altered?

dpelle commented 4 years ago

@bperel wrote:

Does it mean that the method org.languagetool.rules.DoublePunctuationRule#match should be altered?

You probably don't need to change any Java code to add a rule. In general, adding a rule to grammar.xml is enough.

The audience of bibcodes seem too specialized to be added to LanguageTool I suspect. Well it could be added on the condition that it almost never cause false errors. Or do you want to add a rule just for your own build of LT?

You found a regexp to match valid bibcodes, but keep in mind that LT detects errors by specifying erroneous patterns. So what would quality as a pattern for probable bibcode which is not a valid bibcode?

bperel commented 4 years ago

I'm not sure if the audience is too specialized actually: I ran LanguageTool on 500 Wikipedia articles, and 10% of the errors that it found were "Too consecutive dots" due to bibcodes. I suspect that it is mostly in the "References" section of articles that these codes appear though.

I wouldn't mind integrating the rule in upstream LanguageTool if it's useful to people, but like you say the issue is that the regex that I have is for matches, not errors. I don't think that it would be useful to have a bibcode-checker rule (now that would be specialized!), that's why I suggested integrating it into the "two consecutive dots" rule : basically, if the double dots are surrounded by characters forming a bibcode, then don't throw an error. If you believe that it's too specialized, then I will add it to my fork only.