languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.34k stars 1.39k forks source link

[pt] Wrong correction for absence of diacritic #6447

Open ricardojosehlima opened 2 years ago

ricardojosehlima commented 2 years ago

When writing "Quero escreve-los amanhã", LT does notice the absence of the circumflex in "escreve", however its suggestions are wrong: "quero escrever", "quero a escrever", "quero escrevendo", "quero, escreve".

My guess is that the rule is not looking at the hyphen and thus is only applying to "quero escreve" which should be "quero escrever", which is the first suggestion and in my view if there is no hyphen following it, should be the only suggestion.

As for the case with the hyphen, it simply needs to add the correct diacritic depending on the vowel preceding it (á, ê, í, ô).

marcoagpinto commented 2 years ago

I will try to fix it on Friday.

If I am unable to do it on Friday, I will do in on Monday.

🙂

ricardojosehlima commented 2 years ago

Hi @marcoagpinto,

Any updates on this? Sorry don't want to press, maybe with many things to do, this one might have been skipped.

marcoagpinto commented 2 years ago

Ahhhhhhhh… 🙁 I completely forgot it.

Sorry.

And Monday is the official release of LanguageTool, and I am feeling too much pressure to attempt to fix it before the release (the time is approaching).

Can I fix it after the official release?

Thanks!

marcoagpinto commented 2 years ago

Hello @ricardojosehlima

I have fixed the rule: https://github.com/languagetool-org/languagetool/commit/6f8665348b9fb57ebcdb32bda2836984a90477bb

However, that rule is: ID:AUXILIARY_VERB_INFINITIVE

For adding diacritics, I need to create an entirely new rule.

Do you have a good name for the rule and a good suggestion message?

I will also need the help of Jaume for the diacritics.

Here are the results of my fix: 0before.txt

1after.txt

Thanks!

ricardojosehlima commented 2 years ago

Hi @marcoagpinto I have checked the file and it is really the case that there are two different things going on:

1-) two inflected (conjugados) verbs in sequence

The second issue is the lack of diacritics. As is, the rule is not focusing on what matters and is looking to previous elements.

marcoagpinto commented 2 years ago

Hello @ricardojosehlima

I have fixed the false positives: https://github.com/languagetool-org/languagetool/commit/c72cda1f6ef0788f25f64773455c495ba4fa91e4

2.txt

At 5am, I will start working on the new rule.

ricardojosehlima commented 2 years ago

Hi, great!

Some cases with 'a' are still being captured: para continuar a guardá-Ia; etc No começo ele a odiava, mas aos poucos acabou a amando.

And for some reasons these are false positives that remained: Oito novos blindados devem reforçar operações em f... Quando o marido falecer, ela irá remover a joia permanentemente. Em vez de dar para as demandas, você deve remover Netflix Ransomware. Não querendo refazer, poderá utilizar outro documento oficial com foto.

marcoagpinto commented 2 years ago

ahhhhhhhh... I will fix it at 5am.

I am uncertain if I will be able to fix: "para continuar a guardá-Ia; e"

"guardá" is supposed to not have a POS because of the accent, but I may be wrong.

ricardojosehlima commented 2 years ago

Ok, even without POS could a regex work? \w+[aeio]-l[oa]s?

marcoagpinto commented 2 years ago

Ok, even without POS could a regex work? \w+[aeio]-l[oa]s?

I fixed it here: https://github.com/languagetool-org/languagetool/commit/4cfbb941c1e4dc6d71af52ae0e3da9458f647c59

marcoagpinto commented 2 years ago

"refazer" had a POS missing, I fixed it here: https://github.com/languagetool-org/languagetool/commit/801d6d99f694ffc7a3da139fa1e86d146ccb7ac4

marcoagpinto commented 2 years ago

Ahhhhh… tons of false positives removed: https://github.com/languagetool-org/languagetool/commit/af16830648774a69975e496b5c1e26bd98794aff

5.txt

I am too stressed to look at the .txt, I have only been seeing the diffs.

Can you spot more issues in the .txt?

Thanks!

ricardojosehlima commented 2 years ago

Hi, Good to see the rule improving! Great work (again)

I found some things worth mentioning:

1-) the suggestions still include many options that aren't adequate: pudesse fazer; pudesse a fazer; pudesse fazendo; pudesse, faria IMO, obly the first would be sufficient.

2-) There are many cases in which the problem is not as in "ele pode encontra pessoas" ==> "ele pode encontrar pessoas", but the lack of a comma between the two inflected verbs, as in the sample below:

No início vemos pessoas praticamente implorando por seguidores ou acessos, e quando conseguem somem e ignoram o leitor. ...to do resultado do tratamento, o atendimento foi excelente... o estabelecimento de fácil acesso,Vou voltar recomendo. Claro que se ele quisesse podia ter melhores resultados, mas ele contenta-se apenas com o necessário, tenho pena, que não tenha mai... Estamos embarcando hoje, 04/01/2015, de Belo Horizonte, para Buenos Aires e amanhã se Deus quiser vamos para o Ushuaia.

My suggestion is that the message should include after "Verbos auxiliares devem ser seguidos de formas verbais no infinitivo ou no gerúndio.": "se esse não é o caso, verifique se está faltando uma vírgula entre os dois verbos."

3-) Finally, some real false positives:

E agora que já saiu pretendo reler esse é ler os outros dois. Pode remediar-se. Assim, vou pô-lo à prova, para ver se anda, ou não, segundo a minha lei.

marcoagpinto commented 2 years ago

Hello!

I will do it at 5am as usual.

Tomorrow is the release day for the English/British dictionaries.

I must dedicate a lot of time to the task.

marcoagpinto commented 2 years ago

Hello @ricardojosehlima

I have fixed the rule: https://github.com/languagetool-org/languagetool/commit/252ef5193d5fb16e00170dde061c511d534309fd

The three false positives were due to missing or incorrect POSes: https://github.com/languagetool-org/languagetool/commit/eb17ca48c6c93a228cbf6471116fe7d18c9d99a3 https://github.com/languagetool-org/languagetool/commit/59dd8038f2f7c065483c37e9a81881d10724b7d4

Here are the results, the first one with the changed suggestion and the second one with the POS fixes to make it easier to see the diff: 6new_suggestions.txt

7.txt

ricardojosehlima commented 2 years ago

Great!

marcoagpinto commented 2 years ago

@jaumeortola

Hello!

Could you help with this rule?:

    <!-- ESCREVE-LO escrevê-lo -->
    <rule id='ACENTUAÇÃO_VOGAL_ÊNCLISE' name="Acentuação vogal ênclise">
    <!--      Created by Marco A.G.Pinto and Jaume Ortolà with Ricardo Joseh Lima suggestions, Portuguese rule 2022-04-05 (1-JAN-2022+)      -->
    <!--
Quero escreve-lo amanhã. → Quero escrevê-lo amanhã.
    --> 
      <pattern>
          <token postag='V.+' postag_regexp='yes'/>
          <marker>
            <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>       
          </marker>
          <token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
          <token regexp='yes' spacebefore='no'>l[ao]s?</token>
      </pattern>
      <filter class="org.languagetool.rules.pt.ConfusionCheckFilter" args="form:\2 postag:[AN].*"/>
      <message>Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada</message>
      <suggestion>{suggestion}</suggestion>
      <example correction="escrevê">Ele vai <marker>escreve</marker>-lo amanhã.</example>
    </rule>

a TESTRULES PT produces:

Testing rule 2800...
Skipped 0 rules for variant language to avoid checking rules more than once
2824 rules tested.
Exception in thread "main" org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule ACENTUAÇ?O_VOGAL_?NCLISE[1] in file /org/languagetool/rules/pt/grammar.xml: "Ele vai escreve-lo amanh?."
Errors expected: 1
Errors found   : 0
Message: Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada
Analyzed token readings: [/SENT_START*] Ele[ele/PP3MS000*]  [ /null*] vai[ir/VMIP3S0,ir/VMM02S0]  [ /null*] escreve[escrever/VMIP3S0,escrever/VMM02S0] -[-/_PUNCT*] lo[o/PP3MSA00*]  [ /null*] amanh?[amanh?/NCMS000,amanh?/RG] .[./SENT_END*,./_PUNCT*]
Matches: []
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:330)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:466)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:365)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Running disambiguator rule tests...
Running disambiguation tests for Portuguese...

I based it on: <rulegroup id="DIACRITICS" name="Confusão com diacríticos">

Thanks!

jaumeortola commented 2 years ago

So, you want to replace escreve-lo with escrevê-lo. The filter doesn't seem useful here.

Verbs with accent (infinitives?) are tagged this way:

escová  escovar VMX0000
escozicá    escozicar   VMX0000
escoá   escoar  VMX0000
escravizá   escravizar  VMX0000
escrevinhá  escrevinhar VMX0000
escrevê escrever    VMX0000
escriturá   escriturar  VMX0000
escrivá escrivar    VMX0000
escrunchá   escrunchar  VMX0000
escrupulizá escrupulizar    VMX0000
escrutiná   escrutinar  VMX0000

This seems to work:


   <!-- ESCREVE-LO escrevê-lo -->
   <rule id='ACENTUAÇÃO_VOGAL_ÊNCLISE' name="Acentuação vogal ênclise" default="temp_off" >
     <!--      Created by Marco A.G.Pinto and Jaume Ortolà with Ricardo Joseh Lima suggestions, Portuguese rule 2022-04-05 (1-JAN-2022+)      -->
     <!--
Quero escreve-lo amanhã. → Quero escrevê-lo amanhã.
    --> 
     <pattern>
       <token postag='V.+' postag_regexp='yes'/>
       <marker>
         <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>      
       </marker>
       <token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
       <token regexp='yes' spacebefore='no'>l[ao]s?</token>
     </pattern>
     <message>Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada</message>
     <suggestion><match no="2" postag="VMIP3S0|VMM02S0" postag_regexp="yes" postag_replace="VMX0000"/></suggestion>
     <example correction="escrevê">Ele vai <marker>escreve</marker>-lo amanhã.</example>
   </rule>
marcoagpinto commented 2 years ago

@jaumeortola

Thank you, it is working, but there are some false positives.

I fixed them, but I only checked with the sentences provided by LanguageTool.

How do I make the exception check if a verb has punctuation at the end?

          <marker>
            <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'>
            <exception regexp='yes'>[aeiou]?[àáãèéìíòóõùú]</exception>
<!--            
            <exception regexp='yes'>crê|dá|fá|lê|prevê|revê|sê|trá|vê</exception>
-->
          </token>      
          </marker>

That would make it work 100% with all verbs, and not only with those we have as tests.

Thanks!

jaumeortola commented 2 years ago

This? <token postag='VMIP3S0|VMM02S0' postag_regexp='yes' regexp="yes">.*[ea]</token> Or this? <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'><exception regexp="yes">.*[áê]</exception></token>

marcoagpinto commented 2 years ago

@jaumeortola

It produces tons of false positives:

          <marker>
            <token postag='VMIP3S0|VMM02S0' postag_regexp='yes'>
            <exception regexp='yes'>.*[àáãèéìíòóõùú]</exception>
<!--            
            <exception regexp='yes'>crê|dá|fá|lê|prevê|revê|sê|trá|vê</exception>
-->
          </token>      
          </marker>
marcoagpinto commented 2 years ago

@jaumeortola

ahhhh.... sorry... forgot "ê"... let me try again.

marcoagpinto commented 2 years ago

@jaumeortola @ricardojosehlima

The rule has been created: https://github.com/languagetool-org/languagetool/commit/9bc23d54ab651e3eb782b3ffc76d3a3713f6db18

Here are the results: 4enclise.txt

Some verbs may not have POS, so the suggestions don't work with all of them.

jaumeortola commented 2 years ago

Some verbs may not have POS, so the suggestions don't work with all of them.

Have you found any verb without the right POS tags? They can be added.

marcoagpinto commented 2 years ago

@jaumeortola

Yes, tons of them.

In the .txt above, all the ones whose suggestion is between "( )" or "[ ]" (too stressed to remember).

🙂

ricardojosehlima commented 2 years ago

@jaumeortola @marcoagpinto Great work!! So many situations that until now were not seen by LT, and now, they are! As for the verbs that Marco mentioned, they are: (retruca) (retira) (reconhece) (contata) (reanima) (replica)

I scanned the file and nothing more drew my attention.

jaumeortola commented 2 years ago

Thank you for the list of verbs. The verbs starting with re- are not in the tagger dictionary, but they get tagged because they are interpreted as being another verb with the prefix re-. It is as if someone has actively removed these verbs from the dictionary. We need to fix it. A for "contata", the verb "contatar" is there, but not the form "contatá VMX0000".

ricardojosehlima commented 2 years ago

@jaumeortola so maybe it is worth testing the rule against verbs that start with 're-' to check if there are more similar cases? If so, I volunteer. I only need to know where to test it: online would be better, but I can copy/paste the proposed rule in the grammar.xml here in my LO

jaumeortola commented 2 years ago

so maybe it is worth testing the rule against verbs that start with 're-' to check if there are more similar cases? If so, I volunteer.

Thank you, @ricardojosehlima. I have extracted 5,333 verbs starting with re- from all the spelling dictionaries (PT, BR, AO, MZ). What do you think? I guess there are too many verbs. Some are probably unusual and non-existent in common dictionaries. Only 264 are in the tagger dictionary, and a few more (5-6) are in added.txt. Some verbs rere- are most likely generated with two prefixes (rereconsiderar). verbs-re-in-spelling-dicts.txt verbs-re-in-tagger-dict.txt

Once we decide which verbs are valid, I will add the missing ones with the whole conjugation.

Could you take a look? How long would it take to check these 5,000 verbs? We need to remove the verbs that are wrong or very rare.

ricardojosehlima commented 2 years ago

@jaumeortola so, if I understood it well, I would search for valid verbs in spelling.txt and then those would be added to tagger.txt, correct? I can do this to very frequent verbs that I can spot are not in tagger.txt, but not for valid verbs as there are verbs included from PT, AO and MZ and I don't know for example if 'recapar' is even valid in PT, AO and MZ or if it is frequent. I can make a list that could be a consensus between PT, BR, AO and MZ as both valid and frequent. As for how long I will take, I may take some days to finish the spelling.txt

jaumeortola commented 2 years ago

Yes, verbs in verbs-re-in-spelling-dicts.txt, once confirmed, will be added to added.txt (with the whole conjugation, around 85 forms for each verb). But before doing anything, I will compare the differences among the spelling dictionaries and will provide several lists (one for verbs in all dictionaries, another for only BR, and so on), so that we can set priorities. I will post them here.

ricardojosehlima commented 2 years ago

@jaumeortola @marcoagpinto I am doing the dictionary task and indeed the tagger file lacks lots of frequent verbs with re-. However, a doubt has arisen: I see verbs that are not in the tagger being tagged in the https://community.languagetool.org/analysis/analyzeText correctly just like any other verb, for example 'registrar'. So, not being in the tagger is causing what problem?

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

They need to be in spelling.txt or in the corresponding .dic file.

marcoagpinto commented 2 years ago

Ahhhh… @ricardojosehlima

The speller.txt only has words common to all variants.

For specific variants the words must be added to .dics or maybe @jaumeortola knows a better way.

🙂