Open yakovru opened 2 years ago
Are you sure it doesn't work? It is used in several languages (Catalan, French, Spanish...). It should be <unify>
. <unification>
is used in the definition of features. See a French example:
<antipattern>
<unify>
<feature id="number"/>
<feature id="gender"/>
<token postag="(P\+)?D .*" postag_regexp="yes"/>
<token postag="[NJ] .*|V.* ppa" postag_regexp="yes"/>
</unify>
</antipattern>
@jaumeortola
<rulegroup id="SPACE_BEFORE_PUNCTUATION" name="Espaços antes da pontuação">
<!-- Based on German grammar.xml, by Tiago F. Santos, 2017-07-08 -->
<!-- MARCOAGPINTO 2022-01-21 (1-JAN-2022+) *START* -->
<!--
HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
AFTER:xxxx
-->
<antipattern>
<unify>
<token regexp='yes'>extensão|extensões|ficheiros?</token>
<token spacebefore='yes' regexp='yes'>[.]</token>
<token spacebefore='no' postag='NP.+|AQ.+|NC.+' postag_regexp='yes'/>
</unify>
</antipattern>
<!-- MARCOAGPINTO 2022-01-21 (1-JAN-2022+) *END* -->
<rule>
<regexp>\b([\p{L}\d]+) ([!?»”’,….])</regexp>
<message>Remova o espaço antes deste sinal de pontuação.</message>
<suggestion>\1\2</suggestion>
<example correction="escapou!">Como é que isto me <marker>escapou !</marker></example>
<!--example correction="escapou!">Como é que isto me <marker>escapou !</marker></example-->
<example correction="roda.">Existem duas estratégias possíveis: aproveitar o que existe ou reinventar a <marker>roda .</marker></example>
</rule>
<rule>
<regexp>\b([\p{L}\d]+) ([:;])(?![\-o]?(?:[()/]|[DSP]\b))</regexp>
<message>Remova o espaço antes deste sinal de pontuação.</message>
<suggestion>\1\2</suggestion>
<example correction="possíveis:">Existem duas estratégias <marker>possíveis :</marker> aproveitar o que existe ou reinventar a roda.</example>
<example>Um sorriso :-)</example>
<example>Um sorriso :)</example>
<example>Um sorriso :(</example>
<example>Um sorriso :-/</example>
<example>Um sorriso :/</example>
<example>Um sorriso :D</example>
<example correction="Brasil;">Site de Instituto Ludwig von Mises <marker>Brasil ;</marker>Principais portais web</example>
</rule>
</rulegroup>
<rule id="SEMICOLON_AND_QUOTES" name="Ponto e vírgula antes de aspas">
<!-- Localized from German grammar.xml by Tiago F. Santos, 2017-08-17 -->
<antipattern>
<token regexp="yes">„|“|»|«|"</token>
<token>;</token>
<token regexp="yes">»|«|"|”|‘|‹|›|'>
</token>
</antipattern>
<pattern>
<marker>
<token>;</token>
<token spacebefore="no" regexp="yes">“|»|«|"|”|‘|‹|›|'</token>
</marker>
<token spacebefore="yes"/>
</pattern>
<message>Geralmente, não se coloca ponto e vírgula antes de aspas.</message>
<example type="incorrect">“Não me incomode com sua tagarelice infantil<marker>;”</marker> disse Maria; “eu escrevo versos imortais.”</example>
</rule>
TESTRULES PT throws countless errors:
Running XML validation for pt/grammar.xml... cvc-complex-type.2.4.a: Invalid content was found starting with element 'token'. One of '{feature}' is expected. Problem found at line 36457, column 23. Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/pt/grammar.xml' at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:109) at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:207) at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:183) at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:158) at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:153) at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:737) Caused by: org.xml.sax.SAXParseException; lineNumber: 36457; columnNumber: 23; cvc-complex-type.2.4.a: Invalid content was found starting with element 'token'. One of '{feature}' is expected. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.startElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.validation.StreamValidatorHelper.validate(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(Unknown Source) at javax.xml.validation.Validator.validate(Unknown Source) at org.languagetool.XMLValidator.validateInternal(XMLValidator.java:203) at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:107) ... 5 more Running disambiguator rule tests... Running disambiguation tests for Portuguese... Exception in thread "main" java.lang.RuntimeException: Could not activate rules at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:334) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:293) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:353) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:259) at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.testDisambiguationRulesFromXML(DisambiguationRuleTest.java:70) at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.main(DisambiguationRuleTest.java:238) Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/pt/grammar.xml' at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:80) at org.languagetool.Language.getPatternRules(Language.java:641) at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:662) at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:327) ... 5 more Caused by: java.lang.RuntimeException: <regexp> rules currently cannot be used together with <antipattern>. Rule id: SPACE_BEFORE_PUNCTUATION[1] at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:648) at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:408) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
@marcoagpinto
After the unify
tag, there should be a string with the feature
tag.
<antipattern>
<feature id="number"/>
<unify>
<token regexp='yes'>extensão|extensões|ficheiros?</token>
<unify-ignore>
<token spacebefore='yes' regexp='yes'>[.]</token>
</unify-ignore>
<token spacebefore='no' postag='NP.+|AQ.+|NC.+' postag_regexp='yes'/>
</unify>
</antipattern>
@yakovru
Thank you, I will test it at 5am.
What does the <unify-ignore>
do?
@yakovru
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* -->
<!--
HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
AFTER:xxxx
-->
<antipattern>
<feature id="number"/>
<unify>
<token regexp='yes'>arquivos?|extensão|extensões|ficheiros?</token>
<unify-ignore>
<token spacebefore='yes' regexp='yes'>[.]</token>
</unify-ignore>
<token spacebefore='no' postag='NP.+|AQ.+|NC.+|UNKNOWN' postag_regexp='yes'/>
</unify>
</antipattern>
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *END* -->
It throws tons of errors with TESTRULES PT:
Running XML pattern tests...
LanguageTool version 5.7-SNAPSHOT (2022-01-10 19:41:10 +0000, 46c2d6c)
Known languages: [Arabic, English, English (US), English (GB), English (Australian), English (Canadian), English (New Zealand), English (South African), Persian, French, German, German (Germany), German (Austria), German (Swiss), Simple German, Polish, Catalan, Catalan (Valencian), Italian, Breton, Dutch, Dutch (Belgium), Portuguese, Portuguese (Portugal), Portuguese (Brazil), Portuguese (Angola preAO), Portuguese (Moçambique preAO), Russian, Asturian, Belarusian, Chinese, Danish, Esperanto, Irish, Galician, Greek, Japanese, Khmer, Romanian, Slovak, Slovenian, Spanish, Spanish (voseo), Swedish, Tamil, Tagalog, Ukrainian, Testlanguage]
Running XML validation for pt/grammar.xml...
cvc-complex-type.2.4.a: Invalid content was found starting with element 'feature'. One of '{token, and, unify, marker}' is expected. Problem found at line 36441, column 31.
Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/pt/grammar.xml'
at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:109)
at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:207)
at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:183)
at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:158)
at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:153)
at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:737)
Caused by: org.xml.sax.SAXParseException; lineNumber: 36441; columnNumber: 31; cvc-complex-type.2.4.a: Invalid content was found starting with element 'feature'. One of '{token, and, unify, marker}' is expected.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.error(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator$XSIErrorReporter.reportError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.reportSchemaError(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.emptyElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.emptyElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.validation.StreamValidatorHelper.validate(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(Unknown Source)
at javax.xml.validation.Validator.validate(Unknown Source)
at org.languagetool.XMLValidator.validateInternal(XMLValidator.java:203)
at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:107)
... 5 more
Running disambiguator rule tests...
Running disambiguation tests for Portuguese...
Exception in thread "main" java.lang.RuntimeException: Could not activate rules
at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:334)
at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:293)
at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:353)
at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:259)
at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.testDisambiguationRulesFromXML(DisambiguationRuleTest.java:70)
at org.languagetool.tagging.disambiguation.rules.DisambiguationRuleTest.main(DisambiguationRuleTest.java:238)
Caused by: java.io.IOException: Cannot load or parse input stream of '/org/languagetool/rules/pt/grammar.xml'
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:80)
at org.languagetool.Language.getPatternRules(Language.java:641)
at org.languagetool.JLanguageTool.activateDefaultPatternRules(JLanguageTool.java:662)
at org.languagetool.JLanguageTool.<init>(JLanguageTool.java:327)
... 5 more
Caused by: java.lang.RuntimeException: <regexp> rules currently cannot be used together with <antipattern>. Rule id: SPACE_BEFORE_PUNCTUATION[1]
at org.languagetool.rules.patterns.PatternRuleHandler.createRules(PatternRuleHandler.java:648)
at org.languagetool.rules.patterns.PatternRuleHandler.endElement(PatternRuleHandler.java:408)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endNamespaceScope(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.handleEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDValidator.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at org.languagetool.rules.patterns.PatternRuleLoader.getRules(PatternRuleLoader.java:77)
... 8 more
Running XML bitext pattern tests...
Bitext pattern tests successful.
Validating false-friends.xml...
Validation successfully finished.
@yakovru
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* --> <!-- HITS AGAINST A 600 000 CORPORA: BEFORE:xxxx AFTER:xxxx --> <antipattern> <feature id="number"/> <unify>
The feature
tag should be inside unify
.
Hello @yakovru @jaumeortola @udomai
Could one of you directly insert a version of my antipattern in the grammar.xml so that it ships with the official release at the end of March?
It fixes tons of false positives.
I have given up trying to do it myself, since all my attempts produce errors in TESTRULES PT.
Thanks!
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *START* -->
<!--
HITS AGAINST A 600 000 CORPORA:
BEFORE:xxxx
AFTER:xxxx
-->
<antipattern>
<unify>
<token regexp='yes'>arquivos?|extensão|extensões|ficheiros?</token>
<token spacebefore='yes' regexp='yes'>[.]</token>
<token spacebefore='no' postag='NP.+|AQ.+|NC.+|UNKNOWN' postag_regexp='yes'/>
</unify>
</antipattern>
<!-- MARCOAGPINTO 2022-01-22 (1-JAN-2022+) *END* -->
@marcoagpinto I don't understand what you are trying to do. Please tell which rule you are improving, and add a couple of example sentences.
@jaumeortola
"O ficheiro .PNG é grande." "Abre o arquivo .JPG."
It suggests removing the space before the period.
My idea is to create an antipattern for it, but the rule uses regexp, so it doesn't accept normal antipatterns.
@marcoagpinto to use unify
, you must explicitly say which feature
is to be unified.
Please allow use unification in antipatterns. Now we can put
<unification></unification>
tags inside<antipattern></antipattern>
, but it does not work. And the rules test shows no errors.