Open udomai opened 3 years ago
There are some rules that detect the noun/verb issues.
But it won't work with all cases and words have to be added manually.
I will take a look at it this afternoon.
Well, it seems to be fixed.
Can you test it with the sentence used by the user?
Notice that these rules that add an accent to words have a bug: they suggest an accent in all matches of the constant in cause. "paragrafo" suggests "párágráfo" or such.
I don't know how to fix it.
https://github.com/languagetool-org/languagetool/commit/a32fc8a4d31b6424ae965dd83d6898bdda38f731
Thank you, @marcoagpinto!
The ACCENTUATED_PARONYMS
have to be prioritized over CACOFONIA
– I can do that, if you like.
To avoid the regexp_replace
replacing each "a" with an "á", maybe you can do something like this:
...
<token regexp='yes'>(...|par(a)grafo|p(a)tio|perdul(a)rio|...)s?</token>
</pattern>
<message>Esta palavra é um verbo. Se pretende referir-se a um nome ou adjetivo, deve utilizar a forma acentuada.</message>
<suggestion><match no='1' include_skipped='all'/> <match no='2' include_skipped='all'/> <match no='3' regexp_match='$1' regexp_replace='á'/></suggestion>
...
@udomai
Sure, please prioritise it.
How will you do it?
I still don't understand how rules get priority over other rules.
I will work on the regexp fix later on.
Thanks!
Okay, I tried it out – my solution doesn't work... maybe @jaumeortola has an idea!
Also, I saw that the rule doesn't need prioritizing, it is already preferred over the style rule CACOFONIA
. It's just that a sentence starting with "Paragrafo da..." doesn't match the pattern of ACCENTUATED_PARONYMS
(first token not present). Maybe the rule should also include SENT_START
as the first token?
Okay, I tried it out – my solution doesn't work... maybe @jaumeortola has an idea!
What didn't work? Prioritising or reg_exp with parentheses?
My parentheses idea didn't work the way I sketched it above.
Regarding the general kind of errors "paragrafo vs parágrafo", in Spanish and Catalan there are rules a bit more complex and precise that could be adapted to Portuguese. The first step is to extract a list of all possible confusions. Instead of (unreadable) entities in XML, I would put them in a simple file text. Just to be sure, @marcoagpinto: the most usual combination is verb without diacritic vs noun/adj with diacritic, isn't it? Are there other possible confusions? The other way around (verbs with diacritic)?
The rule PARONYM_ARVORE_0 can be generalized for parágrafo
only with a very complex regular expression. So let's just put it in a different rule. https://github.com/languagetool-org/languagetool/commit/c29f51e56bdd8ba9f94cde3e625396e3c901511d
Just to be sure, @marcoagpinto: the most usual combination is verb without diacritic vs noun/adj with diacritic, isn't it? Are there other possible confusions? The other way around (verbs with diacritic)?
To be honest, right now I don't know the answer.
These are the confusions I found (verb without diacritic vs adjective/noun with diacritic). Does it make sense to you, @marcoagpinto? verb-nomadj-pt.txt
@jaumeortola
I have given a quick look at the file.
Many verbs there are unknown to me (maybe of rare usage), but I recognise some.
Is there a plan for the file?
Is there a plan for the file?
The plan is to do the same that has been done in Catalan and Spanish. Put the list in a file, and then it can be used with a filter in different rules. Some rules will be completely safe (for example: preposition + *verb->noun/adjective). But other rules will need testing and exceptions.
@marcoagpinto I'd like to know your opinion about the results of DIACRITICS[1]. I have just fixed some false alarms, but probably you can spot more.
Hello @jaumeortola
There were tons of false positives in the nightly diff.
But the concept is the correct way.
@jaumeortola
What shall I do to fix them?
What shall I do to fix them?
You can see how I fixed some in my last two commits (adding multiwords, improving tags and disambiguation...). If you are not able to fix them, just list them here.
Ahhhh... my task for 5am.
At 5am I will download the latest nightly and see what I can do (with the remaining ones).
There are potential problems with nos + noun/verb
because it is ambiguous. Take a look at the nouns (masc. pl.) in the list, and see if they are more probably nouns or more probably verbs after nos
. We can refine the rule for this case if you find potential problems.
I am working on it right now, for around 2-3 hours already.
I am testing against a 600 000 corpus, so it will take some more time before I commit it.
We got feedback from a user: She found it adorable that we warn about the "foda" in "paragrafo da", but that we do not propose the accent that transforms the verb form into a noun.
@jaumeortola and @marcoagpinto, this is a difficult rule to implement, I think, haven't you been talking about this lately? Can we do something about this?