languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.03k stars 1.38k forks source link

[pt] Disambiguator gives wrong POS to verb - 2023-10-11 #9457

Closed marcoagpinto closed 10 months ago

marcoagpinto commented 11 months ago

Heya, @p-goulart and @susanaboatto

“toma” appears as a noun instead of a verb:

O crime toma formas impensáveis.

Screenshot 2023-10-11 at 01-06-34 Análise de Texto - LanguageTool

Thanks!

p-goulart commented 11 months ago
DAN[1]: crime[crime/AQ0CS0,crime/NCMS000] -> crime[crime/AQ0CS0]
DET-NOUN_PRON-VERB[3]: crime[crime/AQ0CS0] -> crime[crime/AQ0CS0]
DAN[1]: toma[toma/NCCS000,tomar/VMIP3S0,tomar/VMM02S0] -> toma[toma/NCCS000]

We're aware that some disambiguator rules can be a little aggressive, but a lot of patterns might depend on them now, so modifying them might be risky. Is there a specific rule that this disambiguator issue is causing?

marcoagpinto commented 11 months ago

@p-goulart

Yes, I was improving an academic rule and it breaks the rule with the new examples:

        <rule id='TOMAR_ASSUMIR' name="[Universitário][Científico] V. Tomar → V. Assumir" tone_tags="academic" is_goal_specific="true">
            <pattern>
                <token postag='SENT_START|AQ.+|NC.+|NP.+|CS|CC' postag_regexp='yes'/>
                <marker>
                    <token inflected='yes' regexp='yes'>tomar
                        <exception scope='previous' postag_regexp='yes' postag='V.+|PP.+'/>
                        <exception scope='previous' regexp='yes'>decis(ão|ões)</exception>
                    </token>
                </marker>
                <token min='0' max='2' postag='SPS00|(SPS00:)?[DP][ADIPRT].+|RG' postag_regexp='yes'/>
                <token regexp='yes'>cert[ao]s?|determinad[ao]s?|diferentes?|divers[ao]s?|enormes?|formas?|imens[ao]s?|inúmer[ao]s|múltipl[ao]s|vári[ao]s|variad[ao]s</token>
                <token postag='AQ.+|NC.+|PI.+' postag_regexp='yes'>
                    <exception regexp='yes' inflected='yes'>bebida|café|caneca|cerveja|chá|colher|copo|drink|frasco|garfo|garrafa|garrafão|xícara|shot|su[mc]o|vinho|gelado|sorvete|blíster|caixa|comprimido|contracetivo|embalagem|medicação|medicamento|pílula|remédio|autocarro|automóvel|avião|carrinha|carro|jato|ônibus|táxi|veículo|comboio|trem|voo|barc[ao]|bote|canoa|ferry|banho|duche</exception> <!-- Add more words as they are found -->
                </token>
            </pattern>
            <message>Num contexto formal/científico, é preferível escrever &quot;assumir&quot;.</message>
            <suggestion><match no='2' postag='V.+' postag_regexp='yes'>assumir</match></suggestion>
            <example correction="assume">O crime <marker>toma</marker> formas impensáveis.</example>
            <example correction="assume">O crime <marker>toma</marker> formas diversas.</example>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> diversas formas.</example>
            <example correction="assume">O crime é perigoso e o seu financiamento <marker>toma</marker> as mais diversas formas.</example>
        </rule>
marcoagpinto commented 11 months ago

Pedro,

Maybe you could only improve the disambiguator for this verb “tomar”?

This way it will be less risky in a global scale.

p-goulart commented 11 months ago

Sorry, this isn't a priority for right now. I can take care of this issue at some point later, but if you're keen on seeing this working why don't you dive into the disambiguator? It's pretty much like editing rules.

marcoagpinto commented 11 months ago

Pedro, the last time I touched disambiguator several years ago, I “screwed” it up.

I would rather not risk it.

I will ask for the help of Jaume: @jaumeortola

Heya, Jaume, can you help?

Thanks!

p-goulart commented 11 months ago

The fact this didn’t work when you tried several years ago doesn’t mean it won’t work when you try a second time! Have a little faith in yourself ;)

I suggest you give it another go, add us as reviewers, and we’ll have a look later. Whatever broke last time won't break this time, because we'll be here to prevent anything catastrophic from happening.

marcoagpinto commented 11 months ago

The fact this didn’t work when you tried several years ago doesn’t mean it won’t work when you try a second time! Have a little faith in yourself ;)

I suggest you give it another go, add us as reviewers, and we’ll have a look later. Whatever broke last time won't break this time, because we'll be here to prevent anything catastrophic from happening.

Sure, I will try it tonight.

You are right, “my lack of faith is disturbing” (Star Wars).

marcoagpinto commented 11 months ago

https://github.com/languagetool-org/languagetool/pull/9461