languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.49k stars 1.4k forks source link

[pt] Idea for rule: “mais conhecidos/importantes” → “-chave” - 2023-03-18 #8074

Open marcoagpinto opened 1 year ago

marcoagpinto commented 1 year ago

Hello @ricardojosehlima

Look at the sentence: A entropia é um dos conceitos mais conhecidos na teoria. would change to: A entropia é um dos conceitos-chave na teoria.

Should we give it a try?

A scientific/academic rule?

Thanks!

ricardojosehlima commented 1 year ago

Yes

marcoagpinto commented 1 year ago

I am still working on it, the first attempted produced tons of false positives:

Portuguese (Portugal): 237 total matches
Portuguese (Portugal): 582235 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence

0.txt

I was creating antipatterns, but reached the conclusion that it would be easier to make the rule more restrict.

It will have fewer hits, but it will be more accurate and simpler.

marcoagpinto commented 1 year ago

This involves adding POSes one by one and comparing the results.

marcoagpinto commented 1 year ago

@ricardojosehlima

Hello!

I have just finished it: https://github.com/languagetool-org/languagetool/commit/c1af5c3f9a23aaa425d0c72baee553a2fc0a1648

Portuguese (Portugal): 146 total matches
Portuguese (Portugal): 582235 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence

23.txt

    <!-- CONCEITOS MAIS IMPORTANTES conceitos-chave -->
    <rule id='CIENTÍFICO_MAIS_IMPORTANTE_CHAVE' name="[Científico] mais conhecidos/importantes → -chave" type="style">
      <!-- Created by Marco A.G.Pinto with Ricardo Joseh Lima suggestions, Portuguese rule 2023-03-20 (2-MAR-2023+) -->
      <!--
      A entropia é um dos conceitos mais conhecidos na teoria. → A entropia é um dos conceitos-chave na teoria.
      -->

      <antipattern>
        <token postag='NC.+|AQ.+|NP.+' postag_regexp='yes'/>
        <token>mais</token>
        <token regexp='yes'>conhecidos?|importantes?</token>
        <token postag='CC' postag_regexp='no'/>
        <token min="0" max="1">mais</token>
        <token postag='NC.+|AQ.+|NP.+' postag_regexp='yes'/>
        <example>Tóquio é a cidade mais importante e mais moderna do Japão.</example>
        <example>Tóquio é a cidade mais importante e moderna do Japão.</example>
        <example>O Real Madrid é um dos times mais importantes e vitoriosos do mundo!</example>
      </antipattern>

      <pattern>
        <token postag='(SPS00:)?DA.+|DI.+|SPS00|Z0.+|DP.+|CC|NC.+|AQ.+|NP.+' postag_regexp='yes'>
          <exception scope='previous' postag_regexp='yes' postag='AQ.+'/>
        </token>
        <marker>
          <token postag='NC.+|AQ.+' postag_regexp='yes'>
            <exception regexp='yes' inflected='yes'>nadar|ser</exception> <!-- Verbs exceptions -->
            <exception regexp='yes' inflected='yes'>algo|centro|chave|coisa|forma|risco|&languages;</exception> <!-- Nouns/Adjectives exceptions -->
            <exception postag_regexp='yes' postag='Z0.+'/>
          </token>
          <token>mais</token>
          <token regexp='yes'>conhecidos?|importantes?</token>
        </marker>
        <token postag='V.+|SPS00|SPS00:DA.+|SPS00:DD.+|_PUNCT|CC|RG' postag_regexp='yes'/>
      </pattern>
      <message>Num contexto formal/científico, é preferível escrever '-chave'.</message>
      <suggestion>\2-chave</suggestion>
      <example correction="conceitos-chave">A entropia é um dos <marker>conceitos mais conhecidos</marker> na teoria.</example>
    </rule>

Thanks!

😄 ❤️ 🤗

susanaboatto commented 1 year ago

Hi @marcoagpinto

There are many false alarms for this on today's diff. Please remember to temp_off rules when you first create them, so we can fix these errors before they are live to users :)

https://regression.languagetoolplus.com/via-http/2023-03-22/pt-BR/result_grammar_CIENT%C3%8DFICO_MAIS_IMPORTANTE_CHAVE%5B1%5D.html

marcoagpinto commented 1 year ago

@susanaboatto

HELP!!!!

I tested it against 600 000 sentences, see above, and all hits seemed valid.

Maybe my testing files are outdated? pt-PT.txt tatoeba-pt.txt

Could you send me the most recent files?

Thanks!

marcoagpinto commented 1 year ago

HELP!!!!

I tested it against 600 000 sentences, see above, and all hits seemed valid.

Maybe my testing files are outdated? pt-PT.txt tatoeba-pt.txt

Could you send me the most recent files?

Thanks!

@danielnaber @jaumeortola @maphjo

Hello!

When I create a rule and test it against 600 000 sentences, the results I should be getting are the ones of the night diff.

But it seems that is not the case any longer since the night diff shows different results.

Could someone send me the most recent testing files, or the ones being used by the night diff?

The idea is for me not having to temp_off rules since I already know the results.

Thanks!

danielnaber commented 1 year ago

The idea is for me not having to temp_off rules since I already know the results.

I will send you the file for the nightly. Please stick to the workflow with temp_off anyway. It's the most reliable way to avoid false alarms for users.

marcoagpinto commented 1 year ago

@danielnaber

Thank you for the file, but it is the same as the one I had, with just some dozens of extra sentences.

Would you kindly e-mail me the pt-BR one?

Sorry for bothering you.

Thanks!

pt-PT_database_20230322

ricardojosehlima commented 1 year ago

Hi @marcoagpinto , at least for me in pt-br, words like detalhes-chave, fatores-chave look strange. When it is conceito-chave it's fine but all the others don't seem to be valid :(

marcoagpinto commented 1 year ago

@ricardojosehlima

Sure, they seem valid in pt-PT.

I will move it to pt-PT tonight.

Thanks!

marcoagpinto commented 1 year ago

@danielnaber

Thank you for the pt-BR, it seems to have around twice the sentences of the pt-PT.

Now I can do some real testing.

❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️

marcoagpinto commented 1 year ago

@ricardojosehlima @susanaboatto

I got mad after spending 12+ hours trying to fix the false positives, and rewrote the code by just adding the valid words I could find.

https://github.com/languagetool-org/languagetool/commit/68630849794a3503b57d77b2dcbf9e6ffec41a7d

Portuguese (Portugal): 101 total matches
Portuguese (Portugal): 599999 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence

2new.txt

Should we leave it on pt-PT or move back to the original PT?

Thanks!

ricardojosehlima commented 1 year ago

I vote for pt-PT only

marcoagpinto commented 1 year ago

Sure, then it is done!

😋 😋 😋 😋 😋 😋 😋 😋 😋 😋 😋