Open ricardojosehlima opened 2 years ago
I will try to fix it on Friday.
If I am unable to do it on Friday, I will do in on Monday.
🙂
Hi @marcoagpinto,
Any updates on this? Sorry don't want to press, maybe with many things to do, this one might have been skipped.
Ahhhhhhhh… 🙁 I completely forgot it.
Sorry.
And Monday is the official release of LanguageTool, and I am feeling too much pressure to attempt to fix it before the release (the time is approaching).
Can I fix it after the official release?
Thanks!
Hello @ricardojosehlima
I have fixed the rule: https://github.com/languagetool-org/languagetool/commit/6f8665348b9fb57ebcdb32bda2836984a90477bb
However, that rule is: ID:AUXILIARY_VERB_INFINITIVE
For adding diacritics, I need to create an entirely new rule.
Do you have a good name for the rule and a good suggestion message?
I will also need the help of Jaume for the diacritics.
Here are the results of my fix: 0before.txt
Thanks!
Hi @marcoagpinto I have checked the file and it is really the case that there are two different things going on:
1-) two inflected (conjugados) verbs in sequence
The second issue is the lack of diacritics. As is, the rule is not focusing on what matters and is looking to previous elements.
Hello @ricardojosehlima
I have fixed the false positives: https://github.com/languagetool-org/languagetool/commit/c72cda1f6ef0788f25f64773455c495ba4fa91e4
At 5am, I will start working on the new rule.
Hi, great!
Some cases with 'a' are still being captured: para continuar a guardá-Ia; etc No começo ele a odiava, mas aos poucos acabou a amando.
And for some reasons these are false positives that remained: Oito novos blindados devem reforçar operações em f... Quando o marido falecer, ela irá remover a joia permanentemente. Em vez de dar para as demandas, você deve remover Netflix Ransomware. Não querendo refazer, poderá utilizar outro documento oficial com foto.
ahhhhhhhh... I will fix it at 5am.
I am uncertain if I will be able to fix: "para continuar a guardá-Ia; e"
"guardá" is supposed to not have a POS because of the accent, but I may be wrong.
Ok, even without POS could a regex work? \w+[aeio]-l[oa]s?
Ok, even without POS could a regex work? \w+[aeio]-l[oa]s?
I fixed it here: https://github.com/languagetool-org/languagetool/commit/4cfbb941c1e4dc6d71af52ae0e3da9458f647c59
"refazer" had a POS missing, I fixed it here: https://github.com/languagetool-org/languagetool/commit/801d6d99f694ffc7a3da139fa1e86d146ccb7ac4
Ahhhhh… tons of false positives removed: https://github.com/languagetool-org/languagetool/commit/af16830648774a69975e496b5c1e26bd98794aff
I am too stressed to look at the .txt, I have only been seeing the diffs.
Can you spot more issues in the .txt?
Thanks!
Hi, Good to see the rule improving! Great work (again)
I found some things worth mentioning:
1-) the suggestions still include many options that aren't adequate: pudesse fazer; pudesse a fazer; pudesse fazendo; pudesse, faria IMO, obly the first would be sufficient.
2-) There are many cases in which the problem is not as in "ele pode encontra pessoas" ==> "ele pode encontrar pessoas", but the lack of a comma between the two inflected verbs, as in the sample below:
No início vemos pessoas praticamente implorando por seguidores ou acessos, e quando conseguem somem e ignoram o leitor. ...to do resultado do tratamento, o atendimento foi excelente... o estabelecimento de fácil acesso,Vou voltar recomendo. Claro que se ele quisesse podia ter melhores resultados, mas ele contenta-se apenas com o necessário, tenho pena, que não tenha mai... Estamos embarcando hoje, 04/01/2015, de Belo Horizonte, para Buenos Aires e amanhã se Deus quiser vamos para o Ushuaia.
My suggestion is that the message should include after "Verbos auxiliares devem ser seguidos de formas verbais no infinitivo ou no gerúndio.": "se esse não é o caso, verifique se está faltando uma vírgula entre os dois verbos."
3-) Finally, some real false positives:
E agora que já saiu pretendo reler esse é ler os outros dois. Pode remediar-se. Assim, vou pô-lo à prova, para ver se anda, ou não, segundo a minha lei.
Hello!
I will do it at 5am as usual.
Tomorrow is the release day for the English/British dictionaries.
I must dedicate a lot of time to the task.
Hello @ricardojosehlima
I have fixed the rule: https://github.com/languagetool-org/languagetool/commit/252ef5193d5fb16e00170dde061c511d534309fd
The three false positives were due to missing or incorrect POSes: https://github.com/languagetool-org/languagetool/commit/eb17ca48c6c93a228cbf6471116fe7d18c9d99a3 https://github.com/languagetool-org/languagetool/commit/59dd8038f2f7c065483c37e9a81881d10724b7d4
Here are the results, the first one with the changed suggestion and the second one with the POS fixes to make it easier to see the diff: 6new_suggestions.txt
Great!
@jaumeortola
Hello!
Could you help with this rule?:
<!-- ESCREVE-LO escrevê-lo -->
<rule id='ACENTUAÇÃO_VOGAL_ÊNCLISE' name="Acentuação vogal ênclise">
<!-- Created by Marco A.G.Pinto and Jaume Ortolà with Ricardo Joseh Lima suggestions, Portuguese rule 2022-04-05 (1-JAN-2022+) -->
<!--
Quero escreve-lo amanhã. → Quero escrevê-lo amanhã.
-->
<pattern>
<token postag='V.+' postag_regexp='yes'/>
<marker>
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>
</marker>
<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
<token regexp='yes' spacebefore='no'>l[ao]s?</token>
</pattern>
<filter class="org.languagetool.rules.pt.ConfusionCheckFilter" args="form:\2 postag:[AN].*"/>
<message>Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada</message>
<suggestion>{suggestion}</suggestion>
<example correction="escrevê">Ele vai <marker>escreve</marker>-lo amanhã.</example>
</rule>
a TESTRULES PT produces:
Testing rule 2800...
Skipped 0 rules for variant language to avoid checking rules more than once
2824 rules tested.
Exception in thread "main" org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule ACENTUAÇ?O_VOGAL_?NCLISE[1] in file /org/languagetool/rules/pt/grammar.xml: "Ele vai escreve-lo amanh?."
Errors expected: 1
Errors found : 0
Message: Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada
Analyzed token readings: [/SENT_START*] Ele[ele/PP3MS000*] [ /null*] vai[ir/VMIP3S0,ir/VMM02S0] [ /null*] escreve[escrever/VMIP3S0,escrever/VMM02S0] -[-/_PUNCT*] lo[o/PP3MSA00*] [ /null*] amanh?[amanh?/NCMS000,amanh?/RG] .[./SENT_END*,./_PUNCT*]
Matches: []
at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:330)
at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:466)
at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:365)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Running disambiguator rule tests...
Running disambiguation tests for Portuguese...
I based it on:
<rulegroup id="DIACRITICS" name="Confusão com diacríticos">
Thanks!
So, you want to replace escreve-lo
with escrevê-lo
. The filter doesn't seem useful here.
Verbs with accent (infinitives?) are tagged this way:
escová escovar VMX0000
escozicá escozicar VMX0000
escoá escoar VMX0000
escravizá escravizar VMX0000
escrevinhá escrevinhar VMX0000
escrevê escrever VMX0000
escriturá escriturar VMX0000
escrivá escrivar VMX0000
escrunchá escrunchar VMX0000
escrupulizá escrupulizar VMX0000
escrutiná escrutinar VMX0000
This seems to work:
<!-- ESCREVE-LO escrevê-lo -->
<rule id='ACENTUAÇÃO_VOGAL_ÊNCLISE' name="Acentuação vogal ênclise" default="temp_off" >
<!-- Created by Marco A.G.Pinto and Jaume Ortolà with Ricardo Joseh Lima suggestions, Portuguese rule 2022-04-05 (1-JAN-2022+) -->
<!--
Quero escreve-lo amanhã. → Quero escrevê-lo amanhã.
-->
<pattern>
<token postag='V.+' postag_regexp='yes'/>
<marker>
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>
</marker>
<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
<token regexp='yes' spacebefore='no'>l[ao]s?</token>
</pattern>
<message>Quando a ênclise é formado por 'la', 'las', 'lo' ou 'los', a vogal que a precede, antes do hífen, é acentuada</message>
<suggestion><match no="2" postag="VMIP3S0|VMM02S0" postag_regexp="yes" postag_replace="VMX0000"/></suggestion>
<example correction="escrevê">Ele vai <marker>escreve</marker>-lo amanhã.</example>
</rule>
@jaumeortola
Thank you, it is working, but there are some false positives.
I fixed them, but I only checked with the sentences provided by LanguageTool.
How do I make the exception check if a verb has punctuation at the end?
<marker>
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'>
<exception regexp='yes'>[aeiou]?[àáãèéìíòóõùú]</exception>
<!--
<exception regexp='yes'>crê|dá|fá|lê|prevê|revê|sê|trá|vê</exception>
-->
</token>
</marker>
That would make it work 100% with all verbs, and not only with those we have as tests.
Thanks!
This?
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes' regexp="yes">.*[ea]</token>
Or this?
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'><exception regexp="yes">.*[áê]</exception></token>
@jaumeortola
It produces tons of false positives:
<marker>
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'>
<exception regexp='yes'>.*[àáãèéìíòóõùú]</exception>
<!--
<exception regexp='yes'>crê|dá|fá|lê|prevê|revê|sê|trá|vê</exception>
-->
</token>
</marker>
@jaumeortola
ahhhh.... sorry... forgot "ê"... let me try again.
@jaumeortola @ricardojosehlima
The rule has been created: https://github.com/languagetool-org/languagetool/commit/9bc23d54ab651e3eb782b3ffc76d3a3713f6db18
Here are the results: 4enclise.txt
Some verbs may not have POS, so the suggestions don't work with all of them.
Some verbs may not have POS, so the suggestions don't work with all of them.
Have you found any verb without the right POS tags? They can be added.
@jaumeortola
Yes, tons of them.
In the .txt above, all the ones whose suggestion is between "( )" or "[ ]" (too stressed to remember).
🙂
@jaumeortola @marcoagpinto Great work!! So many situations that until now were not seen by LT, and now, they are! As for the verbs that Marco mentioned, they are: (retruca) (retira) (reconhece) (contata) (reanima) (replica)
I scanned the file and nothing more drew my attention.
Thank you for the list of verbs. The verbs starting with re- are not in the tagger dictionary, but they get tagged because they are interpreted as being another verb with the prefix re-. It is as if someone has actively removed these verbs from the dictionary. We need to fix it. A for "contata", the verb "contatar" is there, but not the form "contatá VMX0000".
@jaumeortola so maybe it is worth testing the rule against verbs that start with 're-' to check if there are more similar cases? If so, I volunteer. I only need to know where to test it: online would be better, but I can copy/paste the proposed rule in the grammar.xml here in my LO
so maybe it is worth testing the rule against verbs that start with 're-' to check if there are more similar cases? If so, I volunteer.
Thank you, @ricardojosehlima. I have extracted 5,333 verbs starting with re- from all the spelling dictionaries (PT, BR, AO, MZ). What do you think? I guess there are too many verbs. Some are probably unusual and non-existent in common dictionaries. Only 264 are in the tagger dictionary, and a few more (5-6) are in added.txt. Some verbs rere- are most likely generated with two prefixes (rereconsiderar). verbs-re-in-spelling-dicts.txt verbs-re-in-tagger-dict.txt
Once we decide which verbs are valid, I will add the missing ones with the whole conjugation.
Could you take a look? How long would it take to check these 5,000 verbs? We need to remove the verbs that are wrong or very rare.
@jaumeortola so, if I understood it well, I would search for valid verbs in spelling.txt and then those would be added to tagger.txt, correct? I can do this to very frequent verbs that I can spot are not in tagger.txt, but not for valid verbs as there are verbs included from PT, AO and MZ and I don't know for example if 'recapar' is even valid in PT, AO and MZ or if it is frequent. I can make a list that could be a consensus between PT, BR, AO and MZ as both valid and frequent. As for how long I will take, I may take some days to finish the spelling.txt
Yes, verbs in verbs-re-in-spelling-dicts.txt
, once confirmed, will be added to added.txt
(with the whole conjugation, around 85 forms for each verb).
But before doing anything, I will compare the differences among the spelling dictionaries and will provide several lists (one for verbs in all dictionaries, another for only BR, and so on), so that we can set priorities. I will post them here.
@jaumeortola @marcoagpinto I am doing the dictionary task and indeed the tagger file lacks lots of frequent verbs with re-. However, a doubt has arisen: I see verbs that are not in the tagger being tagged in the https://community.languagetool.org/analysis/analyzeText correctly just like any other verb, for example 'registrar'. So, not being in the tagger is causing what problem?
@ricardojosehlima
Hello!
They need to be in spelling.txt or in the corresponding .dic file.
Ahhhh… @ricardojosehlima
The speller.txt only has words common to all variants.
For specific variants the words must be added to .dics or maybe @jaumeortola knows a better way.
🙂
When writing "Quero escreve-los amanhã", LT does notice the absence of the circumflex in "escreve", however its suggestions are wrong: "quero escrever", "quero a escrever", "quero escrevendo", "quero, escreve".
My guess is that the rule is not looking at the hyphen and thus is only applying to "quero escreve" which should be "quero escrever", which is the first suggestion and in my view if there is no hyphen following it, should be the only suggestion.
As for the case with the hyphen, it simply needs to add the correct diacritic depending on the vowel preceding it (á, ê, í, ô).