Open marcoagpinto opened 1 year ago
Yes
I am still working on it, the first attempted produced tons of false positives:
Portuguese (Portugal): 237 total matches
Portuguese (Portugal): 582235 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence
I was creating antipatterns, but reached the conclusion that it would be easier to make the rule more restrict.
It will have fewer hits, but it will be more accurate and simpler.
This involves adding POSes one by one and comparing the results.
@ricardojosehlima
Hello!
I have just finished it: https://github.com/languagetool-org/languagetool/commit/c1af5c3f9a23aaa425d0c72baee553a2fc0a1648
Portuguese (Portugal): 146 total matches
Portuguese (Portugal): 582235 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence
<!-- CONCEITOS MAIS IMPORTANTES conceitos-chave -->
<rule id='CIENTÍFICO_MAIS_IMPORTANTE_CHAVE' name="[Científico] mais conhecidos/importantes → -chave" type="style">
<!-- Created by Marco A.G.Pinto with Ricardo Joseh Lima suggestions, Portuguese rule 2023-03-20 (2-MAR-2023+) -->
<!--
A entropia é um dos conceitos mais conhecidos na teoria. → A entropia é um dos conceitos-chave na teoria.
-->
<antipattern>
<token postag='NC.+|AQ.+|NP.+' postag_regexp='yes'/>
<token>mais</token>
<token regexp='yes'>conhecidos?|importantes?</token>
<token postag='CC' postag_regexp='no'/>
<token min="0" max="1">mais</token>
<token postag='NC.+|AQ.+|NP.+' postag_regexp='yes'/>
<example>Tóquio é a cidade mais importante e mais moderna do Japão.</example>
<example>Tóquio é a cidade mais importante e moderna do Japão.</example>
<example>O Real Madrid é um dos times mais importantes e vitoriosos do mundo!</example>
</antipattern>
<pattern>
<token postag='(SPS00:)?DA.+|DI.+|SPS00|Z0.+|DP.+|CC|NC.+|AQ.+|NP.+' postag_regexp='yes'>
<exception scope='previous' postag_regexp='yes' postag='AQ.+'/>
</token>
<marker>
<token postag='NC.+|AQ.+' postag_regexp='yes'>
<exception regexp='yes' inflected='yes'>nadar|ser</exception> <!-- Verbs exceptions -->
<exception regexp='yes' inflected='yes'>algo|centro|chave|coisa|forma|risco|&languages;</exception> <!-- Nouns/Adjectives exceptions -->
<exception postag_regexp='yes' postag='Z0.+'/>
</token>
<token>mais</token>
<token regexp='yes'>conhecidos?|importantes?</token>
</marker>
<token postag='V.+|SPS00|SPS00:DA.+|SPS00:DD.+|_PUNCT|CC|RG' postag_regexp='yes'/>
</pattern>
<message>Num contexto formal/científico, é preferível escrever '-chave'.</message>
<suggestion>\2-chave</suggestion>
<example correction="conceitos-chave">A entropia é um dos <marker>conceitos mais conhecidos</marker> na teoria.</example>
</rule>
Thanks!
😄 ❤️ 🤗
Hi @marcoagpinto
There are many false alarms for this on today's diff. Please remember to temp_off
rules when you first create them, so we can fix these errors before they are live to users :)
@susanaboatto
HELP!!!!
I tested it against 600 000 sentences, see above, and all hits seemed valid.
Maybe my testing files are outdated? pt-PT.txt tatoeba-pt.txt
Could you send me the most recent files?
Thanks!
HELP!!!!
I tested it against 600 000 sentences, see above, and all hits seemed valid.
Maybe my testing files are outdated? pt-PT.txt tatoeba-pt.txt
Could you send me the most recent files?
Thanks!
@danielnaber @jaumeortola @maphjo
Hello!
When I create a rule and test it against 600 000 sentences, the results I should be getting are the ones of the night diff.
But it seems that is not the case any longer since the night diff shows different results.
Could someone send me the most recent testing files, or the ones being used by the night diff?
The idea is for me not having to temp_off rules since I already know the results.
Thanks!
The idea is for me not having to temp_off rules since I already know the results.
I will send you the file for the nightly. Please stick to the workflow with temp_off
anyway. It's the most reliable way to avoid false alarms for users.
@danielnaber
Thank you for the file, but it is the same as the one I had, with just some dozens of extra sentences.
Would you kindly e-mail me the pt-BR one?
Sorry for bothering you.
Thanks!
Hi @marcoagpinto , at least for me in pt-br, words like detalhes-chave, fatores-chave look strange. When it is conceito-chave it's fine but all the others don't seem to be valid :(
@ricardojosehlima
Sure, they seem valid in pt-PT.
I will move it to pt-PT tonight.
Thanks!
@danielnaber
Thank you for the pt-BR, it seems to have around twice the sentences of the pt-PT.
Now I can do some real testing.
❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️
@ricardojosehlima @susanaboatto
I got mad after spending 12+ hours trying to fix the false positives, and rewrote the code by just adding the valid words I could find.
https://github.com/languagetool-org/languagetool/commit/68630849794a3503b57d77b2dcbf9e6ffec41a7d
Portuguese (Portugal): 101 total matches
Portuguese (Portugal): 599999 total sentences considered
Portuguese (Portugal): ø0.00 rule matches per sentence
Should we leave it on pt-PT or move back to the original PT?
Thanks!
I vote for pt-PT only
Sure, then it is done!
😋 😋 😋 😋 😋 😋 😋 😋 😋 😋 😋
Hello @ricardojosehlima
Look at the sentence:
A entropia é um dos conceitos mais conhecidos na teoria.
would change to:A entropia é um dos conceitos-chave na teoria.
Should we give it a try?
A scientific/academic rule?
Thanks!