languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.42k stars 1.39k forks source link

[pt] Proclisis correction #6063

Open ricardojosehlima opened 2 years ago

ricardojosehlima commented 2 years ago

The sentence "Nos viram na festa ontem." is receiving the following feedback: "Gramática e outros: Substitua por Nós viram", which doesn't make any sense (to me).

The sentence above is a case of starting a sentence with a weak pronoun ('pronome átono'), which is in disagreement with the rules of standard written Portuguese.

So aside from reviewing the rule above, another one should be created flagging "Me viram na festa ontem." and "Te viram na festa ontem." Here is what I came upon:

<!-- Portuguese rule, 2021-11-10 -->
<rule id="PROCLISE_COMECO_FRASE" name="Proclise_Comeco_Frase">
 <pattern case_sensitive='yes'>
  <marker>
  <token regexp='yes' postag='P.+' postag_regexp='yes'>([MTS]e|Lhe|[NV][oa]s|[OA]s)</token>
  </marker>
  <token postag='V.+' postag_regexp='yes'></token>
 </pattern>
 <message>No registro formal escrito, não se começa frase com pronome átono. Coloque o pronome após o verbo.</message>
 <example correction=''><marker>Me</marker> disseram que ele veio.</example>
 <example>Disseram-me que ele veio.</example>
</rule>

I wasn't able to indicate that the pronouns should be at the start of the sentence using the rule editor, it must be indicated otherwise it will generate false alarms as in "Quem Te Viu Quem Te Vê" (title of a song).

Also, wasn't able to create in the message the correct form: place the verb first, then a hyphen (-) and then the verb.

marcoagpinto commented 2 years ago

Ahhh...

What you mean is: "Nos viram na festa" suggest "Viram-nos na festa"? "Me disseram que ele veio" suggest "Disseram-me que ele veio"?

It is very easy to code.

If it is as simple as that, I will code it at 5am as usual.

Thanks!

ricardojosehlima commented 2 years ago

@marcoagpinto yes, this is it! Note that I didn't use the pronoun 'se' in my rule, for it may be ambiguous: "Se pretende discutir algo" is a case for the rule "Pretende-se discutir isso" "Se pode usar casaco, eu vou usar" is not a case, although I would guess that this construction would be rarer.

marcoagpinto commented 2 years ago

@ricardojosehlima

For now I have fixed the rule: NOS_VERBO https://github.com/languagetool-org/languagetool/commit/d5542eb0b16795547353f9e104d0043eb12ceaa1 https://github.com/languagetool-org/languagetool/commit/f2be8cd985200d896c50372e20e77189d56e4493

And here are the results before and after:

tiago_before.txt

tiago_after.txt

As you can see in the results the rules can use "nós" or "nos" based on the context, that is why I added: <message>Baseado no contexto pode utilizar 'Nós' ou 'Nos'.</message>

Now I will rest for an hour and start working on your rule (I had to change Tiago's rule so that it doesn't affect your rule),

EDIT: Tested against 900 000 sentences.

ricardojosehlima commented 2 years ago

@marcoagpinto right! However, looking at the data, most fit on the other rule, that you are going to build. "Nos viram" should receive "Viram-nos" as a suggestion. On this rule, the idea of presenting the suggestion with the verb, hyphen and the pronoun "Disseram-me" for example will be in conflict with the mesoclisis rule: "Te darei" can't receive as a suggestion "Darei-te" because verbs in the future lead to mesoclisis and should be "Dar-te-ei". Maybe it would be enough to indicate that when in the future mesoclisis should be used or incorporate it in the rule.

marcoagpinto commented 2 years ago

@ricardojosehlima

I have a simple approach to "darei-te", I have created a rule for it months ago "dar-te-ei", so I will just copy/paste from that rule after I find it in the grammar.xml (shouldn't be hard).

marcoagpinto commented 2 years ago

Ahhhhhh… it is triggered by the pt-PT rule:

  <rulegroup id='HIFENIZADOR_VERBOS_2' name='Colocações pronominais dois termos'>
    <!-- Created by Tiago F. Santos , Portuguese rule, 2016-11-04 -->
    <!--        Brazilian Portuguese has some inverted colocations  -->
      <url>https://ciberduvidas.iscte-iul.pt/consultorio/perguntas/a-colocacao-dos-pronomes-atonos/11366</url>
      <short>Erro de colocação pronominal</short>
marcoagpinto commented 2 years ago

@ricardojosehlima

Antipattern created to avoid: "darei-te" and alike. https://github.com/languagetool-org/languagetool/commit/d83b57c1640b0a2367082671b1fac858ea3e42ae

Now I will rest a bit more.

marcoagpinto commented 2 years ago

@ricardojosehlima

Look at this:

    <!-- PODERIAM-SE poder-se-ia -->
    <rulegroup id='PODERIAM-SE' name="Formas verbais: poderiam-se → poder-se-iam">
    <!--      Created by Marco A.G.Pinto, Portuguese rule 2021-03-20 (17-MAR-2021+)      -->
    <!--
Assim poderia-se escrever à Ana. → Assim poder-se-ia escrever à Ana.
Assim poderiam-se escrever livros sobre o assunto. → Assim poder-se-iam escrever livros sobre o assunto.
    -->
        <rule>
            <pattern>
                <token postag='VMIC[13]S0' postag_regexp='yes'/>    
                <token regexp='yes' spacebefore='no'>&hifen;</token>
                <token regexp='yes' spacebefore='no'>se|lhes?|me</token>
            </pattern>
            <message>Esta forma verbal não existe.</message>
            <suggestion><match no='1' postag="VMIC[13]S0" postag_replace='VMN0000' postag_regexp="yes"/>-\3-ia</suggestion>
            <example correction="poder-se-ia">Assim <marker>poderia-se</marker> escrever à Ana.</example>
        </rule>
        <rule>
            <pattern>
                <token postag='VMIC3P0' postag_regexp='no'/>    
                <token regexp='yes' spacebefore='no'>&hifen;</token>
                <token regexp='yes' spacebefore='no'>se|lhes?|me</token>
            </pattern>
            <message>Esta forma verbal não existe.</message>
            <suggestion><match no='1' postag="VMIC3P0" postag_replace='VMN0000' postag_regexp="yes"/>-\3-iam</suggestion>
            <example correction="poder-se-iam">Assim <marker>poderiam-se</marker> escrever livros sobre o assunto.</example>
        </rule>
    </rulegroup>

This rule doesn't accept "te" ("poder-te-iam").

I am about to improve it.

marcoagpinto commented 2 years ago

Ahhhhh... I have improved the previous rule: id='PODERIAM-SE'

https://github.com/languagetool-org/languagetool/commit/aa8246e5a8df73f626ac7370cc570982949dd4cd

Basically, the rule you suggested is a changed version of this rule.

Right now I can't focus more on code, so I will resume at 5am… sorry… 🙁

Only at 5am I can focus properly.

marcoagpinto commented 2 years ago

Improved it even more: https://github.com/languagetool-org/languagetool/commit/20cb7a3f8c2e9a62004bb1c7b123a55fc9801afa

Now only at 5am.

ricardojosehlima commented 2 years ago

Cool!

marcoagpinto commented 2 years ago

@ricardojosehlima

I have the rule working (and before 5am), but it throws exceptions regarding the case in TESTRULES PT, and I don't know how to fix.

I have tested it on the Stand-alone tool.

@udomai @jaumeortola Could one of you help fix the case in the suggestions?:

Skipped 0 rules for variant language to avoid checking rules more than once
2791 rules tested.
Exception in thread "main" org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule PROCLISE_COMECO_FRASE[1] in file /org/languagetool/rules/pt/grammar.xml: Incorrect suggestions: Expected 'Disse-me', got: 'Disse-Me' on input: 'Me disse que ele veio.'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:582)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:474)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule PROCLISE_COMECO_FRASE[2] in file /org/languagetool/rules/pt/grammar.xml: Incorrect suggestions: Expected 'Poder-te-ia', got: 'Poder-Te-ia' on input: 'Te poderia escrever uma carta?'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:582)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:474)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule PROCLISE_COMECO_FRASE[3] in file /org/languagetool/rules/pt/grammar.xml: Incorrect suggestions: Expected 'Poder-nos-iam', got: 'Poder-Nos-iam' on input: 'Nos poderiam escrever uma carta?'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:582)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:474)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule PROCLISE_COMECO_FRASE[1] in file /org/languagetool/rules/pt/grammar.xml: Incorrect suggestions: Expected 'Disseram-me', got: 'Disseram-Me' on input: 'Me disseram que ele veio.'
        at org.languagetool.rules.patterns.PatternRuleTest.addError(PatternRuleTest.java:322)
        at org.languagetool.rules.patterns.PatternRuleTest.assertSuggestions(PatternRuleTest.java:582)
        at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:474)
        at org.languagetool.rules.patterns.PatternRuleTest.lambda$testGrammarRulesFromXML$1(PatternRuleTest.java:357)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

Rule code:

    <!-- ME DISSE Disse-me / ME DISSERAM Disseram-me / TE PODERIA Poder-te-ia / NOS PODERIAM Poder-nos-iam -->
    <rulegroup id='PROCLISE_COMECO_FRASE' name="Proclise começo frase">
    <!--      Created by Ricardo Joseh Lima and improved by Marco A.G.Pinto, Portuguese rule 2021-11-25 (25-JUN-2021+)      -->
    <!--
Me disse que ele veio. → Disse-me que ele veio.
Me disseram que ele veio. → Disseram-me que ele veio.
Te poderia escrever uma carta? → Poder-te-ia escrever uma carta?
Nos poderiam escrever uma carta? → Poder-nos-iam escrever uma carta?
    -->
        <rule>
            <pattern>
                <token postag='SENT_START' postag_regexp='no'/>
                <marker>
                    <token regexp='yes' spacebefore='no'>se|lhes?|me|te|nos|vos</token>
                    <token postag='VMI[MS][13].+' postag_regexp='yes'/>
                </marker>
            </pattern>
            <message>No registro formal escrito, não se começa frase com pronome átono. Coloque o pronome após o verbo.</message>
            <suggestion>\3-\2</suggestion>
            <example correction="Disse-me"><marker>Me disse</marker> que ele veio.</example>
            <example correction="Disseram-me"><marker>Me disseram</marker> que ele veio.</example>
        </rule> 
        <rule>
            <pattern>
                <token postag='SENT_START' postag_regexp='no'/>
                <marker>
                    <token regexp='yes' spacebefore='no'>se|lhes?|me|te|nos|vos</token>
                    <token postag='VMI[CF].S.+' postag_regexp='yes'/>
                </marker>
            </pattern>
            <message>No registro formal escrito, não se começa frase com pronome átono. Coloque o pronome após o verbo.</message>
            <suggestion><match no='3' postag="V.+" postag_replace='VMN0000' postag_regexp="yes"/>-\2-ia</suggestion>
            <example correction="Poder-te-ia"><marker>Te poderia</marker> escrever uma carta?</example>
        </rule>
        <rule>
            <pattern>
                <token postag='SENT_START' postag_regexp='no'/>
                <marker>
                    <token regexp='yes' spacebefore='no'>se|lhes?|me|te|nos|vos</token>
                    <token postag='VMI[CF].P.+' postag_regexp='yes'/>
                </marker>
            </pattern>
            <message>No registro formal escrito, não se começa frase com pronome átono. Coloque o pronome após o verbo.</message>
            <suggestion><match no='3' postag="V.+" postag_replace='VMN0000' postag_regexp="yes"/>-\2-iam</suggestion>
            <example correction="Poder-nos-iam"><marker>Nos poderiam</marker> escrever uma carta?</example>
        </rule>
    </rulegroup>
marcoagpinto commented 2 years ago

@ricardojosehlima

I found out how to do it: https://github.com/languagetool-org/languagetool/commit/4b9b5f58059d6dca32de5365bb99ce9497dfb13f

Here are the results against 900 000 sentences: Proclise_RicardoJosehLima_20211125.txt

SentenceSourceChecker: org.languagetool.dev.dumpcheck.DocumentLimitReachedException: Maximum number of documents (900000) reached
Portuguese (Portugal): 110 total matches
Portuguese (Portugal): ø0.00 rule matches per sentence
Portuguese (Portugal): 0 input lines ignored (e.g. not between 10 and 300 chars or at least 4 tokens)

There is an issue with the structure: "encontrar-nos-emos" as it suggests ""encontrar-nos-ias" or something like that.

I need to rewrite the rule from the other day and replace the code in this rule to fix this issue.

Maybe around the end of next week I will do it since it is almost British dictionary update day and on Monday and Tuesday I will dedicate most of the time to the dictionary.

Thanks!

ricardojosehlima commented 2 years ago

Great job!!!

Just a few comments:

1-) Along with the problem you found with 'encontrar-nos-emos', I found some others: Te amarei-> Amarei-te Me fará --> Farar-me-ia; Fazer-me-ia

2-) You included 'Se' in the rule. However, the below sentences bring 'Se' as conjunction and they are all correct and shouldn't be captured by the rule:

Se foi sancionado pelo Tribunal de Justiça da União Europeia a 13 de Março de 1968 em matéria de política ... Se separou uma carta fechada, receberá outra fechada ou se separou uma carta aberta, receberá outra aberta. Se estiveram com o vosso pai, tudo bem.

However, both sentences below represent the group where 'Se' is incorrectly places and the rule should apply:

Se iniciou no final dos anos 1970 e se diluiu em diversos estilos nos anos 1980. Se opôs, principalmente, aos excessos do rock progressivo, do fusion e do hard rock quando, em 1977, invadi...

IMHO, there are more of the second group than of the first. However, as it may generate false alarms to include 'Se' in the rule, it is up to you in languagetool to consider the options.

3-) Finally, the sentences in the file almost all come from Tatoeba and are from informal writing. It would be awkward to correct "Me pegaram!" from the mouth of a Brazilian child and replace it to "Pegaram-me". So, perhaps, this rule should either be applied only to formal registers or in the message appear a warning "If this is a formal register, seriously consider replacing to "Pegaram-me"".

marcoagpinto commented 2 years ago

Ahhhh... I will have a look at it next week.

marcoagpinto commented 2 years ago

@ricardojosehlima

I have released the dictionaries today, so I decided I would have some time for LanguageTool.

I have been improving the rule, from which I will take the code for the one suggested by you.

I have made two commits for it.

Could you check if this is valid?:

Assim poderia-se escrever à Ana. → Assim poder-se-ia escrever à Ana.
Assim poderias-te declarar à Ana. → Assim poder-te-ias declarar à Ana.
Assim poderiam-se escrever livros sobre o assunto. → Assim poder-se-iam escrever livros sobre o assunto.
Assim encontrarei-te amanhã. → Assim encontrar-te-ei amanhã.
Assim encontraremos-nos amanhã. → Assim encontrar-nos-emos amanhã.
Assim fará-me o trabalho. → Assim fazer-me-á o trabalho.
Assim farás-me o trabalho. → Assim fazer-me-ás o trabalho.
Assim farão-me o trabalho. → Assim fazer-me-ão o trabalho.

I could swear that "Assim farás-me o trabalho." could be written in "Assim far-me-ás o trabalho.", but no dictionary has "far", so I used the infinitive.

Also, here is the test against 900 000 sentences, but unfortunately almost 100% of hits are verbs ending with "-se" so I could only check if the rule works with the examples above in the standalone tool.

poderiam-se_poder-se-iam_20211130.txt

If all is well, tomorrow at 5am I will recode your rule and remove using "se" at the start of a sentence like you suggested to remove false positives.

Thanks!

ricardojosehlima commented 2 years ago

Hi, and yes your intuition on 'far-me-ás' is correct, 'fazer-me-ás' is not standard Portuguese (for me). About the 'se' I was referring to the pronoun at the start of the sentence, not the one after the verb. So, 'poderia-se' can still be corrected as 'poder-se-ia'; however 'Se poderia' is what should be excluded.

marcoagpinto commented 2 years ago

@ricardojosehlima

So, how do I fix the "far-me-ás", "far-me-ias", etc.?

Should I simply ignore the verb "fazer"?

There is no POS information for "far" nor it appears in dictionaries.

Thanks!

ricardojosehlima commented 2 years ago

@marcoagpinto Maybe add it to some dictionary of languagetool?

marcoagpinto commented 2 years ago

@ricardojosehlima

I won't work since the rule/POS won't be able to identify between "far" for this verb and for the others.

The best is to add an exception for the verb "fazer".

Are there any more verbs that should be added to the exception?

Thanks!

marcoagpinto commented 2 years ago

@ricardojosehlima

I have big plans for LanguageTool in 2022!

With your help, we will be able to solve most of the verb issues.

🙂

Can you send me your e-mail in private for the Christmas e-mail I send to my friends every year?

Thanks!

ricardojosehlima commented 2 years ago

Hi @marcoagpinto yes there are other verbs in the same situation of fazer: dizer, trazer are two that I remember. As for my email, thanks it is a honor to be included in your Christmas list, but how and where can I send it privately to you?

marcoagpinto commented 2 years ago

send it from the forum 🙂

ricardojosehlima commented 2 years ago

Ok!

marcoagpinto commented 2 years ago

@ricardojosehlima

It is done, and I believe it is solid as a rock: https://github.com/languagetool-org/languagetool/commit/e68586dd34cd8868ada621ead305e00b54522a1e

Here are the results against 900 000 sentences: Proclise_RicardoJosehLima_20211202.txt

I haven't received any private message from you so far. 🙂

ricardojosehlima commented 2 years ago

Hi @marcoagpinto indeed it is! Sorry for the delay in sending you the private message, yesterday was one of those extra busy days, but I have already sent it via the forum.