languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.03k stars 1.38k forks source link

[pt] Tutorial for developing PT rules #6882

Open marcoagpinto opened 2 years ago

marcoagpinto commented 2 years ago

Hello @susanaboatto

I will update this several times to add/improve the information, it will probably take weeks to do it as I progress slowly in things.

2. Use of normal words and regexp

Susana wrote:

I want to add antipatterns to this rule to fix the false positives in the case of numbers with percentages and where "fora" is not a verb. What do you think of the antipatterns below?

  <antipattern case_sensitive='yes'>
      <token>Fora</token>
  <token postag='AQ..P.+|NC.P.+' postag_regexp='yes'/>
  <token postag='AQ..P.+|NC.P.+' postag_regexp='yes'/>
      <example>Fora casos de ausência por questões de saúde, não é permitido faltar.</example>
  </antipattern>
  <antipattern>
      <token postag=',|-' postag_regexp='yes'/>
      <token>fora</token>
  <token postag='AQ..P.+|NC.P.+' postag_regexp='yes'/>
      <example>Não é permitido faltar, fora casos de ausência por questões de saúde.</example>
  </antipattern>
  <antipattern>
    <token postag='V.+' postag_regexp='yes'/>
    <token postag='NCMP000' postag_regexp='no'/>
      <example>Estou 100% com você.</example>
  </antipattern>

Reply from Marco:

Always see the POS information at:
https://community.languagetool.org/analysis/analyzeText

Paste there:

Fora casos de ausência por questões de saúde, não é permitido faltar.
Não é permitido faltar, fora casos de ausência por questões de saúde.

You can see that "fora" can be a noun and also verb "ir" and "ser":

![fora_postags_20220706](https://user-images.githubusercontent.com/5192600/177743892-0527ce1d-7ac0-4cf9-a1b0-fa3ee768fed0.png)

Using words should be avoided the most possible since they limit the rules.
This should work with your two first antipatterns:

  <antipattern>
    <token postag='NCCS000' postag_regexp='no'/>
    <token postag='AQ..P.+|NC.P.+' postag_regexp='yes'/>
    <token postag='SPS00' postag_regexp='no'/>
<example>Fora casos de ausência por questões de saúde, não é permitido faltar.</example>
<example>Não é permitido faltar, fora casos de ausência por questões de saúde.</example>
  </antipattern>

Then make some tests replacing:
<token postag='NCCS000' postag_regexp='no'/>
with:
<token postag='NC.S.+' postag_regexp='yes'/>

Use the corpus provided by Daniel Naber to check the results (unzip the Wikipedia tool into a folder):
java -Dfile.encoding=UTF-8 -Xmx4500M -jar languagetool-wikipedia.jar check-data -l pt-PT -r GENERAL_NUMBER_AGREEMENT_ERRORS -f pt-PT.txt -f tatoeba-pt.txt --max-sentences 600000 --context-size 100 >0.txt

You can replace here the rule ID and the number of sentences. I have at 600 000 because I am slightly mad.

The first check is before the creation of antipattern, use 0.txt as output.

As you make changes and test them, change the output to 1.txt 2.txt 3.txt , etc.

Then make a DIFF (on Windows, you can use TortoiseSVN) to see the false positives and success.

If you have several .txts, rename the most correct one to something like "2stable.txt", which means that future outputs can be checked (diff) against 2stable.txt instead of 0.txt.

Only check the last output with 0.txt before coming the change, to make sure nothing escapes.

Also, unzip the standalone tool to the desktop to do a
testrules pt
or
testrules pt-pt
or
testrules pt-br
depending on what you are doing.

Ahhhh... on the Wikipedia and Standalone unzips, you need to write the commands from the shell (DOS).

You can simply open the folders in normal windows, copy the path, open the shell by writing CMD in the magnifying lent on Window 10 and type there
cd path_here
You should have two shells/CLI, one for Wikipedia and other to standalonetool

I advise you to use Notepad++ to edit the files, it is extremelly powerful, just remember to go to its menu that allows to select the language of the document and change to XML so that all tags appear with colours.

I have three grammar.xml in my notepad++

    from the repository (checkout)
    standalone tool
    wikipedia tool

Always make the changes first on the stand alone and do a TESTRULES PT, then copy the whole text into the wikipedia file (CTR+A, CTR+C) go to the wikipedia tab and press (CTR+A, CTR+V).

Then test the results.

If all is okay you can copy the file from the standalone tool and replace the repository one with it and commit.

ahhhhh... tell me if this helps. I really need to write some good documentation for all this with examples.

You can find each grammar.xml, spelling.txt, added.txt by searching for them (at least in Windows) inside the folders, but you should open the folder and then a subfolder because otherwise Windows won't show hits.

barbarisms-pt.txt (grammar.xml folder)
pre-reform-compounds.txt (added.txt folder)
README_pt_PT.txt  (spelling.txt folder)

See? Search for README_pt_PT.txt and you will find spelling.txt folder, etc.

Ahhhhhhhh…

I remembered one thing I always do.

Before starting to edit or create new rules, I first update the checkout files.

Only after it, I open Notepad++ to make sure I have the latest grammar.xml.

Then I copy and paste that grammar.xml into the standalone tool grammar.xml.

When you have the chance, check the results at:
https://internal1.languagetool.org/regression-tests/via-http
(for PT they should appear during the night, 4am? 5am? I am not sure).

<token postag='NC.S.+' postag_regexp='no'/>

TESTRULES PT gives a warning because you should have used: postag_regexp='yes'

"regexp" means "Reg(ular) Exp(ression)" which is basically when you use more than one POS (separated with "|") or use special control characters in POSes such as "." (which means that at that position anything can be used), "[" and "]", "?", etc.

If you use regular expressions and forget to use "blah blah="yes"" the rule will miswork (it will give wrong or incomplete results).

3. Rules that create too many FPs

Susana wrote: Hi @marcoagpinto, this rule is throwing many false alarms (see pictures). I want to remove it from the PT grammar - unless you have a better suggestion. I would say that, at least for the PT-BR grammar, it is not relevant because it doesn't really correct any mistakes (and there aren't any to be found).

Marco replied:

Please use the:
default="off">

I need to fix tons of antipatterns.

If a rule throws too many false positives, use the default off.

I will slowly fix the rules.

4. Gender and number agreements for POSes

For number agreements independent of gender, you can use:

AQ..P.+|NC.P.+
AQ..S.+|NC.S.+

For gender agreements, use:

AQ.[CM].+|NC[CM].+
AQ.[CF].+|NC[CF].+

Sometimes you can't use the "C" or will create false positives (you must test the rules) against the corpuses which @danielnaber provides.

For any non-gender and non-number agreements, you can simply use:

NC.+
AQ.+

5. Some possible sentences for suggestions:

      <message>Em certos contextos, esta perífrase pode ser simplificada.</message>
      <message>Enriqueça a linguagem para causar mais impacto ao leitor.</message>
      <message>Esta perífrase pode ser simplificada.</message>
      <message>Esta perífrase poderá ser simplificada.</message>
      <message>Expressão vulgar, pondere empregar:</message>
      <message>Possível confusão de termos.</message>
      <message>Se for um texto académico, pondere melhorar a linguagem.</message>
      <message>Se for um texto académico/científico, pondere melhorar a linguagem.</message>
      <message>Se for um texto académico/científico, pondere empregar o termo 'imprecisão'.</message>
      <message>Se for um texto académico/científico, pondere empregar o termo 'exato'.</message>
      <message>Se for uma tese de doutoramento, verifique se o 'tom' de redação é o apropriado.</message>
      <message>Se estiver a referir-se a fármacos ou afins, empregue o termo 'embalagem'.</message>
      <message>Se estiver a referir-se a fármacos ou afins, empregue o termo 'tomar'.</message>

6. Suggestions handling

I have these commands written in a text file and copy and change here and there while developing rules.

I need to add more clear examples when I have the time.

      <filter class="org.languagetool.rules.pt.AdvancedSynthesizerFilter" args="lemmaFrom:3 lemmaSelect:V.* postagFrom:1 postagSelect:V.*"/>

      <rule id='INIMIGO_ADVERSÁRIO_ALIADO_OPONENTE' (FOR NOUNS WITH 'C' (BOTH MALE AND FEMALE) >
      <filter class="org.languagetool.rules.pt.AdvancedSynthesizerFilter" args="lemmaFrom:4 lemmaSelect:NC.* postagFrom:2 postagSelect:NC(.)(.).* postagReplace:NC[\b1C][\b2N].*" />

<match no='2' postag='VMN0000' postag_regexp="yes" postag_replace='VMII1S0|VMII3S0'/>

<match no='1' postag='V.+' postag_regexp='yes'>arrendar</match>

<suggestion><match no='1' postag='(V..).(.+)' postag_replace='$1P$2'>estar</match> mal disposto</suggestion>

<suggestion>prognóstico<match no='1' regexp_match='(diagnóstico)(s?)' regexp_replace='$2'/> \2</suggestion>

<suggestion>desde <match no='2' regexp_match='(d)(.)(s?)' regexp_replace='$2$3'/></suggestion>

Jaume wrote:
SENT_START_NUM is the rule ID. 
Here, I added the option of adding spelled numbers as suggestions: `<suggestion><match no="3" postag="_spell_number_" case_conversion="firstupper"/></suggestion>`
12 -> Doze, 1000 -> Mil, and so on.

7. Exceptions and regular expressions

It is just a matter of adapting what I write here for special cases.

<token postag='NC.S.+' postag_regexp='no'/>

TESTRULES PT gives a warning because you should have used: postag_regexp='yes'

"regexp" means "Reg(ular) Exp(ression)" which is basically when you use more than one POS (separated with "|") or use special control characters in POSes such as "." (which means that at that position anything can be used), "[" and "]", "?", etc.

If you use regular expressions and forget to use "blah blah="yes"" the rule will miswork (it will give wrong or incomplete results).
            <exception postag='SENT_START'/>
          </token>

          <token postag='V.+' postag_regexp='yes'>
            <exception postag_regexp='yes' postag='AQ.+|NP.+|VMIP3S0|VMM02S0'/>
            <exception regexp='yes'>concertos?</exception>
          </token>

          <token postag='V.+' postag_regexp='yes'>
            <exception postag_regexp='yes' postag='AQ.+|NP.+|VMIP3S0|VMM02S0'/>
          </token>

          <exception regexp='yes'>concertos?</exception>
          <exception>concerto</exception>

          <marker>
            <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'>
            <exception regexp='yes'>.*[àáãâèéêìíîòóõôùúû]</exception>
          </token>      

<token min="0" max="1" postag='NP.+|AQ.+|NC.+' postag_regexp='yes'/>

<token>tokenfalsafalso</token>
<token>vida</token>
<token regexp='yes'>[ao]s?</token>
<token regexp='yes'>dia|segunda|terça|quarta|quinta|sexta|sábado|domingo</token>

<token negate="yes">ao</token>
<token negate="yes" regexp='yes'>[ao]s?</token>
<token negate_pos="yes" postag='CS|CC|RN|RG' postag_regexp='yes'/>

Avoid using the negate tag, Jaume said it may produce false results (I am rewriting the rules to use the exception commands with scope="previous" and scope="next".

Look at this important info told me by Jaume:

Hi Marco!

You want to match this within the pattern, right? In that case, it’s

<token><match no="3"/></token> to match the fourth (!) token.

P.S.:
Caution: If you refer to the fourth token outside of the pattern, for example in the suggestion, it’s <suggestion><match no="4" .../></suggestion>...

8. Some rule types

 type="style">

 type="style" default="temp_off">

 type="style" tags="picky">

 type="style" tags="picky" default="temp_off">

 default="temp_off">

 default="off">

9. Some POS for added.txt based on Priberam words lookup

Not everything is here. Basically, if an on-line dictionary says that a word is a male noun or something like that you have an example word with its POS and you can search for variants of it in our tagger dictionary.


adj. 2 g.
informacional | adj. 2 g.
AQ0CS0
masc. e fem. pl. de informacional
AQ0CP0

adj. 2 g. 2 núm.
unissexo | adj. 2 g. 2 núm.
AQ0CN0

adj. 2 g. s. 2 g.
budista | adj. 2 g. s. 2 g.
AQ0CS0
NCCS000
budistas | adj. 2 g. s. 2 g.
AQ0CP0
NCCP000

adj. s. f.
tomadora | adj. s. f.
AQ0FS0
NCFS000
tomadoras | fem. pl. de tomador
AQ0FP0
NCFP000

adj. s. m.
tomador | adj. s. m.
AQ0MS0
NCMS000
tomadores | masc. pl. de tomador
AQ0MP0
NCMP000

adv.
sintaticamente | adv.
RG

gerúndio de verbo transitivo/intransitivo/pronominal
bebendo | gerúndio de beber
VMG0000

fem. sing. part. pass transitivo e intransitivo
bebida | singular
VMP00SF
bebidas | plural
VMP00PF

masc. sing. part. pass transitivo e intransitivo
bebido | singular
VMP00SM
bebidos | plural
VMP00PM

prep.
por | prep.
SPS00

pron. pess. 2 g.
você | singular
PP3CS000
vocês | plural
PP3CP000

s. 2 g.
agente | s. 2 g.
NCCS000
agentes | masc. e fem. pl. de agente
NCCP000 

s. f.
garrafa | s. f.
NCFS000
garrafas | fem. pl. de garrafa
NCFP000

s. f. | s. 2 g.
segurança | s. f. | s. 2 g.
NCFS000
NCMS000
seguranças | masc. e fem. pl. de segurança
NCFP000
NCMP000

s. m.
frasco | s. m.
NCMS000
frascos | masc. pl. de frasco
NCMP000

s. m. 2 núm.
NCMN000

v. tr.
beber | v. tr
VMN0000
VMN01S0
VMN03S0
VMSF1S0
VMSF3S0

v. tr. e intr. | v. tr.
violar | v. tr. e intr. | v. tr.
VMN0000
VMSF1S0
VMSF3S0

v. tr. | v. intr. | v. pron.
desdizer | v. tr. | v. intr. | v. pron.
VMN0000
VMN01S0
VMN03S0 

v. tr. | v. pron.
reduzir | v. tr. | v. pron.
VMN0000
VMSF1S0
VMSF3S0

10. Comments from Marco

Susana, this document requires a lot of revising, as time goes by, I will enhance it.

marcoagpinto commented 2 years ago

Updated on 2022-07-07:

  1. Changed the topics to bold;
  2. Added topic 9.