languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.82k stars 1.38k forks source link

Allow unification in antipatterns #6245

Open yakovru opened 2 years ago

yakovru commented 2 years ago

Please allow use unification in antipatterns. Now we can put <unification></unification> tags inside <antipattern></antipattern>, but it does not work. And the rules test shows no errors.

jaumeortola commented 2 years ago

Are you sure it doesn't work? It is used in several languages (Catalan, French, Spanish...). It should be <unify>. <unification> is used in the definition of features. See a French example:

<antipattern>
    <unify>
        <feature id="number"/>
        <feature id="gender"/>
        <token postag="(P\+)?D .*" postag_regexp="yes"/>
        <token postag="[NJ] .*|V.* ppa" postag_regexp="yes"/>
    </unify>
</antipattern>
marcoagpinto commented 2 years ago

@jaumeortola

  <rulegroup id="SPACE_BEFORE_PUNCTUATION" name="Espaços antes da pontuação">
    <!-- Based on German grammar.xml, by Tiago F. Santos, 2017-07-08 -->

<!-- MARCOAGPINTO 2022-01-21 (1-JAN-2022+) *START* -->
<!--

HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
 AFTER:xxxx
-->
      <antipattern>
       <unify>
        <token regexp='yes'>extensão|extensões|ficheiros?</token>
        <token spacebefore='yes' regexp='yes'>[.]</token>
        <token spacebefore='no' postag='NP.+|AQ.+|NC.+' postag_regexp='yes'/>
       </unify>
      </antipattern>
<!-- MARCOAGPINTO 2022-01-21 (1-JAN-2022+) *END* -->

    <rule>
      <regexp>\b([\p{L}\d]+) ([!?»”’,….])</regexp>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="escapou!">Como é que isto me <marker>escapou !</marker></example>
    <!--example correction="escapou!">Como é que isto me <marker>escapou   !</marker></example-->
      <example correction="roda.">Existem duas estratégias possíveis: aproveitar o que existe ou reinventar a <marker>roda .</marker></example>
    </rule>
    <rule>
      <regexp>\b([\p{L}\d]+) ([:;])(?![\-o]?(?:[()/]|[DSP]\b))</regexp>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="possíveis:">Existem duas estratégias <marker>possíveis :</marker> aproveitar o que existe ou reinventar a roda.</example>
      <example>Um sorriso :-)</example>
      <example>Um sorriso :)</example>
      <example>Um sorriso :(</example>
      <example>Um sorriso :-/</example>
      <example>Um sorriso :/</example>
      <example>Um sorriso :D</example>
      <example correction="Brasil;">Site de Instituto Ludwig von Mises <marker>Brasil ;</marker>Principais portais web</example>
    </rule>
  </rulegroup>

    <rule id="SEMICOLON_AND_QUOTES" name="Ponto e vírgula antes de aspas">
    <!-- Localized from German grammar.xml by Tiago F. Santos,  2017-08-17      -->
      <antipattern>
          <token regexp="yes">„|“|»|«|"</token>
          <token>;</token>
          <token regexp="yes">»|«|"|”|‘|‹|›|'>
          </token>
      </antipattern>
      <pattern>
        <marker>
          <token>;</token>
          <token spacebefore="no" regexp="yes">“|»|«|"|”|‘|‹|›|'</token>
        </marker>
          <token spacebefore="yes"/>
      </pattern>
      <message>Geralmente, não se coloca ponto e vírgula antes de aspas.</message>
      <example type="incorrect">“Não me incomode com sua tagarelice infantil<marker>;”</marker> disse Maria; “eu escrevo versos imortais.”</example>
    </rule>

TESTRULES PT throws countless errors:

Running XML validation for pt/grammar.xml... cvc-complex-type.2.4.a: Invalid content was found starting with element 'token'. One of '{feature}' is expected. Problem found at line 36457, column 23. Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/pt/grammar.xml' at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:109) at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:207) at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:183) at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:158) at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:153) at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:737) Caused by: org.xml.sax.SAXParseException; lineNumber: 36457; columnNumber: 23; cvc-complex-type.2.4.a: Invalid content was found starting with element 'token'. One of '{feature}' is expected. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.startElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.validation.StreamValidatorHelper.validate(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(Unknown Source) at javax.xml.validation.Validator.validate(Unknown Source) at org.languagetool.XMLValidator.validateInternal(XMLValidator.java:203) at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:107) ... 5 more Running disambiguator rule tests... Running disambiguation tests for Portuguese... Exception in thread "main" java.lang.RuntimeException: Could not activate rules at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:334) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:293) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:353) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:259) at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.testDisambiguationRulesFromXML(DisambiguationRuleTest.java:70) at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.main(DisambiguationRuleTest.java:238) Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/pt/grammar.xml' at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:80) at org.languagetool.Language.getPatternRules(Language.java:641) at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:662) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:327) ... 5 more Caused by: java.lang.RuntimeException: <regexp> rules currently cannot be used together with <antipattern>. Rule id: SPACE_BEFORE_PUNCTUATION[1] at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:648) at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:408) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)

yakovru commented 2 years ago

@marcoagpinto After the unify tag, there should be a string with the feature tag.

<antipattern>
        <feature id="number"/>
       <unify>
        <token regexp='yes'>extensão|extensões|ficheiros?</token>
                <unify-ignore>
             <token spacebefore='yes' regexp='yes'>[.]</token>
                </unify-ignore>
        <token spacebefore='no' postag='NP.+|AQ.+|NC.+' postag_regexp='yes'/>
       </unify>
</antipattern>
marcoagpinto commented 2 years ago

@yakovru

Thank you, I will test it at 5am.

marcoagpinto commented 2 years ago

What does the <unify-ignore> do?

marcoagpinto commented 2 years ago

@yakovru

<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* -->
<!--

HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
 AFTER:xxxx
-->
      <antipattern>
        <feature id="number"/>
            <unify>
                <token regexp='yes'>arquivos?|extensão|extensões|ficheiros?</token>
                <unify-ignore>
                    <token spacebefore='yes' regexp='yes'>[.]</token>
                </unify-ignore>
                    <token spacebefore='no' postag='NP.+|AQ.+|NC.+|UNKNOWN' postag_regexp='yes'/>
            </unify>
      </antipattern>
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *END* -->

It throws tons of errors with TESTRULES PT:

Running XML pattern tests...
LanguageTool version 5.7-SNAPSHOT (2022-01-10 19:41:10 +0000, 46c2d6c)
Known languages: [Arabic, English, English (US), English (GB), English (Australian), English (Canadian), English (New Zealand), English (South African), Persian, French, German, German (Germany), German (Austria), German (Swiss), Simple German, Polish, Catalan, Catalan (Valencian), Italian, Breton, Dutch, Dutch (Belgium), Portuguese, Portuguese (Portugal), Portuguese (Brazil), Portuguese (Angola preAO), Portuguese (Moçambique preAO), Russian, Asturian, Belarusian, Chinese, Danish, Esperanto, Irish, Galician, Greek, Japanese, Khmer, Romanian, Slovak, Slovenian, Spanish, Spanish (voseo), Swedish, Tamil, Tagalog, Ukrainian, Testlanguage]
Running XML validation for pt/grammar.xml...
cvc-complex-type.2.4.a: Invalid content was found starting with element 'feature'. One of '{token, and, unify, marker}' is expected. Problem found at line 36441, column 31.
Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/pt/grammar.xml'
        at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:109)
        at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:207)
        at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:183)
        at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:158)
        at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:153)
        at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:737)
Caused by: org.xml.sax.SAXParseException; lineNumber: 36441; columnNumber: 31; cvc-complex-type.2.4.a: Invalid content was found starting with element 'feature'. One of '{token, and, unify, marker}' is expected.
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.emptyElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.emptyElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.validation.StreamValidatorHelper.validate(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(Unknown Source)
        at javax.xml.validation.Validator.validate(Unknown Source)
        at org.languagetool.XMLValidator.validateInternal(XMLValidator.java:203)
        at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:107)
        ... 5 more
Running disambiguator rule tests...
Running disambiguation tests for Portuguese...
Exception in thread "main" java.lang.RuntimeException: Could not activate rules
        at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:334)
        at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:293)
        at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:353)
        at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:259)
        at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.testDisambiguationRulesFromXML(DisambiguationRuleTest.java:70)
        at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.main(DisambiguationRuleTest.java:238)
Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/pt/grammar.xml'
        at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:80)
        at org.languagetool.Language.getPatternRules(Language.java:641)
        at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:662)
        at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:327)
        ... 5 more
Caused by: java.lang.RuntimeException: <regexp> rules currently cannot be used together with <antipattern>. Rule id: SPACE_BEFORE_PUNCTUATION[1]
        at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:648)
        at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:408)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:77)
        ... 8 more
Running XML bitext pattern tests...
Bitext pattern tests successful.
Validating false-friends.xml...
Validation successfully finished.
milekpl commented 2 years ago

@yakovru


<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* -->
<!--

HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
 AFTER:xxxx
-->
    <antipattern>
        <feature id="number"/>
          <unify>

The feature tag should be inside unify.

marcoagpinto commented 2 years ago

Hello @yakovru @jaumeortola @udomai

Could one of you directly insert a version of my antipattern in the grammar.xml so that it ships with the official release at the end of March?

It fixes tons of false positives.

I have given up trying to do it myself, since all my attempts produce errors in TESTRULES PT.

Thanks!

<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* -->
<!--

HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
 AFTER:xxxx
-->
      <antipattern>
       <unify>
        <token regexp='yes'>arquivos?|extensão|extensões|ficheiros?</token>
        <token spacebefore='yes' regexp='yes'>[.]</token>
        <token spacebefore='no' postag='NP.+|AQ.+|NC.+|UNKNOWN' postag_regexp='yes'/>
       </unify>
      </antipattern>
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *END* -->
jaumeortola commented 2 years ago

@marcoagpinto I don't understand what you are trying to do. Please tell which rule you are improving, and add a couple of example sentences.

marcoagpinto commented 2 years ago

@jaumeortola

"O ficheiro .PNG é grande." "Abre o arquivo .JPG."

It suggests removing the space before the period.

My idea is to create an antipattern for it, but the rule uses regexp, so it doesn't accept normal antipatterns.

milekpl commented 2 years ago

@marcoagpinto to use unify, you must explicitly say which feature is to be unified.