languagetool-org / portuguese-pos-dict

Portuguese POS tagger
GNU Lesser General Public License v2.1
5 stars 2 forks source link

diminutives #1

Closed jaumeortola closed 7 months ago

jaumeortola commented 2 years ago

@ricardojosehlima @marcoagpinto

In this repo, I am rebuilding the Portuguese tagger dictionary. It will have twice as many words as the previous dictionary.

This is a list of diminutive and augmentative adjectives: diminutives.txt Do they make sense in the tagger dictionary? Are they correct and usual words? Or are they automatically generated?

Most of them are not accepted by the spellers (PT, BR...).

abalado AQCFP0 abaladazinhas
abalado AQCFP0 abaladazitas
abalado AQCFP0 abaladinhas
abalado AQCFP0 abaladitas
abalado AQCFS0 abaladazinha
abalado AQCFS0 abaladazita
abalado AQCFS0 abaladinha
abalado AQCFS0 abaladita
abalado AQCMP0 abaladinhos
abalado AQCMP0 abaladitos
abalado AQCMP0 abaladozinhos
abalado AQCMP0 abaladozitos
abalado AQCMS0 abaladinho
abalado AQCMS0 abaladito
abalado AQCMS0 abaladozinho
abalado AQCMS0 abaladozito
esperto AQAFP0 espertaças
esperto AQAFP0 espertonas
esperto AQAFS0 espertaça
esperto AQAFS0 espertona
esperto AQAMP0 espertaços
esperto AQAMP0 espertões
esperto AQAMS0 espertaço
esperto AQAMS0 espertão
marcoagpinto commented 2 years ago

Well, a lot of them seem strange to me.

Maybe Ricardo can answer?

jaumeortola commented 2 years ago

Same question for nouns: nouns-augm.txt nouns-dim.txt

jaumeortola commented 2 years ago

Currently, we have a LanguageTool rule (INFORMALITIES[3]) that recommends avoiding many of this augmentative and diminutive words, but without providing suggestions: <message>Linguagem informal. Considere as alternativas.</message> Is this rule important? Does it make sense? Do we want to keep it as it is? What suggestion would be appropriate? The non-diminutive/non-augmentative word?

marcoagpinto commented 2 years ago

@jaumeortola

Yes, the rule seems good.

The informalities are in an ENTITY.

I have been adding words to it as I find them.

marcoagpinto commented 2 years ago

It is impossible to add suggestions to all of them (they are in an entity).

marcoagpinto commented 2 years ago

Just like the bad language ENTITY, it is impossible to add suggestions to all.

jaumeortola commented 2 years ago

INFORMALITIES[3] is not using an ENTITY. It uses POS tags: <token postag_regexp='yes' postag='AQC.+|N.....[AD]'> It could be possible to suggest the non-diminutive word if we keep the current tagging and change a bit the rule: abaladinhos (AQCMS0) -> abalados (AQ0MS0) But keeping the tagging requires some work. I am going to do it only if the rule is good and necessary.

We should avoid rules without suggestions. Most of the time they are useless. The words in the entity informal could be moved to a replacement rule with suggestions. Ex. estrambótico=estrambólico|esquisito|extravagante|ridículo|raro.

marcoagpinto commented 2 years ago

@jaumeortola I haven't been adding POS of diminutives with "D".

That would require reviewing the whole added.txt file.

Also, the pejorative rule I created last week, it only has one suggestion and the words are in an ENTITY.

marcoagpinto commented 2 years ago

For me, I would keep the ENTITY for informalities and if this or that word isn't considered an informality, we could add an exception to the rule.

jaumeortola commented 2 years ago

That would require reviewing the whole added.txt file.

The added.txt file will be reviewed anyway. I plan to move all these words to the new tagger dictionary.

marcoagpinto commented 2 years ago

The added.txt file will be reviewed anyway. I plan to move all these words to the new tagger dictionary.

ahhhh... good to know 🙂

ricardojosehlima commented 2 years ago

Hi @jaumeortola and @marcoagpinto today was an exceptionally busy day for me, and only now I could access my notebook to read the messages. Personally, I don't consider the diminutive augmentative rule useful. If it Words like 'espertona', 'abaladinho' would rarely show in a formal text, and maybe the user really wants to use the word in diminutive or augmentative for pragmatic purposes. What cases do you imagine of a diminutive augmentative inadequate usage in a formal text (and would this happen in the real world)?

jaumeortola commented 2 years ago

Personally, I don't consider the diminutive augmentative rule useful.

That was also my impression. If the rule is not essential, the tagger dictionary can be simplified a bit.

Are the words I posted here in several files correct? I have doubts about the diminutive/augmentative adjectives (file diminutives.txt in the first comment) because many of them are not accepted by the speller dictionaries.

ricardojosehlima commented 2 years ago

All diminutives with ito seem strange to me, rare, maybe only a few could exist in current registers. The augm file has few words, all ok to me.

marcoagpinto commented 2 years ago

@jaumeortola

Can I add words to added.txt or should I wait for the new tagger?

"Mary" is marked as male and causes false positives in gender. "Palmeiras" needs to be NP male to avoid false positives in gender.

Thanks!

jaumeortola commented 2 years ago

I would like to upgrade to the new tagger dictionary ASAP to avoid conflicts with your current work. Mary is fixed in the new dictionary because I also detected the problem. Anyway, you can add words to added.txt. I will take care of merging them into the new dictionary. With the new tagger, you fill find many more words tagged, and also many more words tagged with several tags, which might require disambiguation. This is a price we'll have to pay.

p-goulart commented 7 months ago

@susanaboatto @marcoagpinto to what degree is this still relevant? Diminutives and augmentatives should be generated by the .aff files, but their tagging is patchy at best. Should we go for a more productive (i.e. quasi-inflectional) approach in the tagger scripts as well?