languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.01k stars 1.38k forks source link

antipattern bug with skip #189

Closed danielnaber closed 9 years ago

danielnaber commented 9 years ago

I think there's a bug with antipattern: If it's used with skip, the immunization doesn't reach to the end of the pattern. For example, this rule matches This is a simple sentence. even though the antipattern should prevent that. With the -v option you can see that sentence doesn't get immunized.

<rule id="BUGGY" name="name">
    <antipattern>
        <token skip="-1">a</token>
        <token>sentence</token>
    </antipattern>
    <pattern>
        <token>sentence</token>
    </pattern>
    <message>my message</message>
    <example type='incorrect'>This <marker>sentence</marker>.</example>
    <example type='correct'>This is a simple <marker>sentence</marker>.</example>
</rule>
jaumeortola commented 9 years ago

Hi,

The antipattern needs always the to be functional. We talked about this in the list.

a sentence sentence my message This sentence. This is a simple sentence.

This should be changed. Either the marker is required in the xml syntax or, if not written, it is unterstood implicitly from start to end.

Regards, Jaume

2014-09-16 20:38 GMT+02:00 Daniel Naber notifications@github.com:

I think there's a bug with antipattern: If it's used with skip, the immunization doesn't reach to the end of the pattern. For example, this rule matches This is a simple sentence. even though the antipattern should prevent that. With the -v option you can see that sentence doesn't get immunized.

a sentence sentence my message This sentence. This is a simple sentence.

— Reply to this email directly or view it on GitHub https://github.com/languagetool-org/languagetool/issues/189.

danielnaber commented 9 years ago

I think it should work just like for pattern: if there's no marker, the marker is implicit and covers the complete (anti)pattern.

dpelle commented 9 years ago

Daniel Naber wrote:

I think it should work just like for pattern: if there's no marker, the marker is implicit and covers the complete (anti)pattern.

That is what I expected too. But now I just read the Marcin's reply to a similar question here and I'm confused:

https://www.mail-archive.com/languagetool-devel@lists.sourceforge.net/msg03397.html

Wow, I overlooked that completely (and I'm not the only one it seems). I have plenty of rules in French for example, where I do not use at all, so does it mean they are wrong? Example of French rule:

  <rule>
    <antipattern>
      <token>Jimi</token>
      <token>Hendrix</token>
    </antipattern>
    <pattern>
      <token regexp="yes">Jimm?[yi]</token>
      <token regexp="yes">Hendrix|Hendric?ks</token>
    </pattern>
    <message>Écrivez <suggestion>Jimi Hendrix</suggestion> s’il s’agit du guitariste.</message>
    <url>https://fr.wikipedia.org/wiki/Jimi_Hendrix</url>
    <example type="incorrect" correction="Jimi Hendrix"><marker>Jimi Hendricks</marker> est né à Seattle.</example>
    <example type="incorrect" correction="Jimi Hendrix"><marker>Jimmy Hendrix</marker> est né à Seattle.</example>
    <example type="correct">Jimi Hendrix est né à Seattle.</example>
  </rule>

It seems to work. All the following sentences give errors for example:

The correct "Jimi Hendrix" does not give any errors as expected thanks to the antipattern.

So I do not understand the need for in

rules.
milekpl commented 9 years ago

W dniu 2014-09-16 o 21:30, Dominique Pellé pisze:

Daniel Naber wrote:

I think it should work just like for pattern:
if there's no marker, the marker is implicit
and covers the complete (anti)pattern.

That is what I expected too. But now I just read the Marcin's reply to a similar question here and I'm confused:

https://www.mail-archive.com/languagetool-devel@lists.sourceforge.net/msg03397.html

Wow, I overlooked that completely (and I'm not the only one it seems). I have plenty of rules in French for example, where I do not use at all, so does it mean they are wrong? Example of French rule:

|

Jimi Hendrix
 <pattern>
   <token regexp="yes">Jimm?[yi]</token>
   <token regexp="yes">Hendrix|Hendric?ks</token>
 </pattern>
 <message>Écrivez <suggestion>Jimi Hendrix</suggestion> s’il s’agit du guitariste.</message>
 <url>https://fr.wikipedia.org/wiki/Jimi_Hendrix</url>
 <example type="incorrect" correction="Jimi Hendrix"><marker>Jimi Hendricks</marker> est né à Seattle.</example>
 <example type="incorrect" correction="Jimi Hendrix"><marker>Jimmy Hendrix</marker> est né à Seattle.</example>
 <example type="correct">Jimi Hendrix est né à Seattle.</example>

|

It seems to work. All the following sentences give errors for example:

  • Jimy Hendrix
  • Jimi Hendricks
  • Jimmy Hendrix
  • Jimmi Hendrix

The correct "Jimi Hendrix" does not give any errors as expected thanks to the antipattern.

So I do not understand the need for in rules.

There is a bug, and the bug is simply worked around if you use the marker. I don't know how to fix it, and I assure that I've spent several days looking at it. Probably this needs a fresh eye.

Regards, Marcin

danielnaber commented 9 years ago

Also see http://languagetool-user-forum.2306527.n4.nabble.com/What-is-the-difference-between-these-two-antipatterns-td4641883.html#a4641891 for another example that might be caused by the same bug (didn't look at that yet).

milekpl commented 9 years ago

The problem is in: org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRuleReplacer#executeAction

lines 168-211

Really, I have no idea how to fix these bugs.

danielnaber commented 9 years ago

I think I have a fix/workaround that just uses this as the lastMatchToken argument to executeAction:

int lastMatchTokenTmp = lastMarkerMatchToken != -1 ? lastMarkerMatchToken : lastMatchToken;

Fixes the bug described above and passes all tests. Should I commit this?

milekpl commented 9 years ago

If this also fixes the other examples from the forum, I'd go for the fix. Looks sensible to me. If anything breaks, we'll see changes in the wiki/tatoeba nightly diff.

danielnaber commented 9 years ago

This (probably) caused regression in nightly check:

25,000 sentences checked...
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
        at java.util.ArrayList.rangeCheck(ArrayList.java:604)
        at java.util.ArrayList.get(ArrayList.java:382)
        at org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRuleReplacer.executeAction(DisambiguationPatternRuleReplacer.java:304)
        at org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRuleReplacer.replace(DisambiguationPatternRuleReplacer.java:134)
        at org.languagetool.tagging.disambiguation.rules.DisambiguationPatternRule.replace(DisambiguationPatternRule.java:102)
        at org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator.disambiguate(XmlRuleDisambiguator.java:65)
        at org.languagetool.tagging.disambiguation.pl.PolishHybridDisambiguator.disambiguate(PolishHybridDisambiguator.java:47)
        at org.languagetool.JLanguageTool.getAnalyzedSentence(JLanguageTool.java:720)
        at org.languagetool.JLanguageTool.analyzeSentences(JLanguageTool.java:546)
        at org.languagetool.JLanguageTool.check(JLanguageTool.java:522)

Caused by Polish input winny obowiązany jest.

danielnaber commented 9 years ago

I'm reverting my fix for now because of the IndexOutOfBoundsException (see above).

danielnaber commented 9 years ago

This should be fixed / worked around now.