languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.42k stars 1.39k forks source link

[pt] not found: paragrafo vs parágrafo #4103

Open udomai opened 3 years ago

udomai commented 3 years ago

We got feedback from a user: She found it adorable that we warn about the "foda" in "paragrafo da", but that we do not propose the accent that transforms the verb form into a noun.

@jaumeortola and @marcoagpinto, this is a difficult rule to implement, I think, haven't you been talking about this lately? Can we do something about this?

immagine

marcoagpinto commented 3 years ago

There are some rules that detect the noun/verb issues.

But it won't work with all cases and words have to be added manually.

I will take a look at it this afternoon.

marcoagpinto commented 3 years ago

Well, it seems to be fixed.

Can you test it with the sentence used by the user?

Notice that these rules that add an accent to words have a bug: they suggest an accent in all matches of the constant in cause. "paragrafo" suggests "párágráfo" or such.

I don't know how to fix it.

https://github.com/languagetool-org/languagetool/commit/a32fc8a4d31b6424ae965dd83d6898bdda38f731

udomai commented 3 years ago

Thank you, @marcoagpinto!

The ACCENTUATED_PARONYMS have to be prioritized over CACOFONIA – I can do that, if you like.

To avoid the regexp_replace replacing each "a" with an "á", maybe you can do something like this:

...
<token regexp='yes'>(...|par(a)grafo|p(a)tio|perdul(a)rio|...)s?</token>
      </pattern>
      <message>Esta palavra é um verbo. Se pretende referir-se a um nome ou adjetivo, deve utilizar a forma acentuada.</message>
        <suggestion><match no='1' include_skipped='all'/> <match no='2' include_skipped='all'/> <match no='3' regexp_match='$1' regexp_replace='á'/></suggestion>
...
marcoagpinto commented 3 years ago

@udomai

Sure, please prioritise it.

How will you do it?

I still don't understand how rules get priority over other rules.

I will work on the regexp fix later on.

Thanks!

udomai commented 3 years ago

Okay, I tried it out – my solution doesn't work... maybe @jaumeortola has an idea!

Also, I saw that the rule doesn't need prioritizing, it is already preferred over the style rule CACOFONIA. It's just that a sentence starting with "Paragrafo da..." doesn't match the pattern of ACCENTUATED_PARONYMS (first token not present). Maybe the rule should also include SENT_START as the first token?

marcoagpinto commented 3 years ago

Okay, I tried it out – my solution doesn't work... maybe @jaumeortola has an idea!

What didn't work? Prioritising or reg_exp with parentheses?

udomai commented 3 years ago

My parentheses idea didn't work the way I sketched it above.

jaumeortola commented 3 years ago

Regarding the general kind of errors "paragrafo vs parágrafo", in Spanish and Catalan there are rules a bit more complex and precise that could be adapted to Portuguese. The first step is to extract a list of all possible confusions. Instead of (unreadable) entities in XML, I would put them in a simple file text. Just to be sure, @marcoagpinto: the most usual combination is verb without diacritic vs noun/adj with diacritic, isn't it? Are there other possible confusions? The other way around (verbs with diacritic)?

jaumeortola commented 3 years ago

The rule PARONYM_ARVORE_0 can be generalized for parágrafo only with a very complex regular expression. So let's just put it in a different rule. https://github.com/languagetool-org/languagetool/commit/c29f51e56bdd8ba9f94cde3e625396e3c901511d

marcoagpinto commented 3 years ago

Just to be sure, @marcoagpinto: the most usual combination is verb without diacritic vs noun/adj with diacritic, isn't it? Are there other possible confusions? The other way around (verbs with diacritic)?

To be honest, right now I don't know the answer.

jaumeortola commented 3 years ago

These are the confusions I found (verb without diacritic vs adjective/noun with diacritic). Does it make sense to you, @marcoagpinto? verb-nomadj-pt.txt

marcoagpinto commented 3 years ago

@jaumeortola

I have given a quick look at the file.

Many verbs there are unknown to me (maybe of rare usage), but I recognise some.

Is there a plan for the file?

jaumeortola commented 3 years ago

Is there a plan for the file?

The plan is to do the same that has been done in Catalan and Spanish. Put the list in a file, and then it can be used with a filter in different rules. Some rules will be completely safe (for example: preposition + *verb->noun/adjective). But other rules will need testing and exceptions.

jaumeortola commented 3 years ago

@marcoagpinto I'd like to know your opinion about the results of DIACRITICS[1]. I have just fixed some false alarms, but probably you can spot more.

marcoagpinto commented 3 years ago

Hello @jaumeortola

There were tons of false positives in the nightly diff.

But the concept is the correct way.

marcoagpinto commented 3 years ago

@jaumeortola

What shall I do to fix them?

jaumeortola commented 3 years ago

What shall I do to fix them?

You can see how I fixed some in my last two commits (adding multiwords, improving tags and disambiguation...). If you are not able to fix them, just list them here.

marcoagpinto commented 3 years ago

Ahhhh... my task for 5am.

At 5am I will download the latest nightly and see what I can do (with the remaining ones).

jaumeortola commented 3 years ago

There are potential problems with nos + noun/verb because it is ambiguous. Take a look at the nouns (masc. pl.) in the list, and see if they are more probably nouns or more probably verbs after nos. We can refine the rule for this case if you find potential problems.

marcoagpinto commented 3 years ago

I am working on it right now, for around 2-3 hours already.

I am testing against a 600 000 corpus, so it will take some more time before I commit it.