Closed jaumeortola closed 7 months ago
Well, a lot of them seem strange to me.
Maybe Ricardo can answer?
Same question for nouns: nouns-augm.txt nouns-dim.txt
Currently, we have a LanguageTool rule (INFORMALITIES[3]) that recommends avoiding many of this augmentative and diminutive words, but without providing suggestions: <message>Linguagem informal. Considere as alternativas.</message>
Is this rule important? Does it make sense? Do we want to keep it as it is? What suggestion would be appropriate? The non-diminutive/non-augmentative word?
@jaumeortola
Yes, the rule seems good.
The informalities are in an ENTITY.
I have been adding words to it as I find them.
It is impossible to add suggestions to all of them (they are in an entity).
Just like the bad language ENTITY, it is impossible to add suggestions to all.
INFORMALITIES[3] is not using an ENTITY. It uses POS tags: <token postag_regexp='yes' postag='AQC.+|N.....[AD]'>
It could be possible to suggest the non-diminutive word if we keep the current tagging and change a bit the rule:
abaladinhos (AQCMS0) -> abalados (AQ0MS0)
But keeping the tagging requires some work. I am going to do it only if the rule is good and necessary.
We should avoid rules without suggestions. Most of the time they are useless. The words in the entity informal
could be moved to a replacement rule with suggestions. Ex. estrambótico=estrambólico|esquisito|extravagante|ridículo|raro
.
@jaumeortola I haven't been adding POS of diminutives with "D".
That would require reviewing the whole added.txt file.
Also, the pejorative rule I created last week, it only has one suggestion and the words are in an ENTITY.
For me, I would keep the ENTITY for informalities and if this or that word isn't considered an informality, we could add an exception to the rule.
That would require reviewing the whole added.txt file.
The added.txt file will be reviewed anyway. I plan to move all these words to the new tagger dictionary.
The added.txt file will be reviewed anyway. I plan to move all these words to the new tagger dictionary.
ahhhh... good to know 🙂
Hi @jaumeortola and @marcoagpinto today was an exceptionally busy day for me, and only now I could access my notebook to read the messages. Personally, I don't consider the diminutive augmentative rule useful. If it Words like 'espertona', 'abaladinho' would rarely show in a formal text, and maybe the user really wants to use the word in diminutive or augmentative for pragmatic purposes. What cases do you imagine of a diminutive augmentative inadequate usage in a formal text (and would this happen in the real world)?
Personally, I don't consider the diminutive augmentative rule useful.
That was also my impression. If the rule is not essential, the tagger dictionary can be simplified a bit.
Are the words I posted here in several files correct? I have doubts about the diminutive/augmentative adjectives (file diminutives.txt in the first comment) because many of them are not accepted by the speller dictionaries.
All diminutives with ito seem strange to me, rare, maybe only a few could exist in current registers. The augm file has few words, all ok to me.
@jaumeortola
Can I add words to added.txt or should I wait for the new tagger?
"Mary" is marked as male and causes false positives in gender. "Palmeiras" needs to be NP male to avoid false positives in gender.
Thanks!
I would like to upgrade to the new tagger dictionary ASAP to avoid conflicts with your current work. Mary is fixed in the new dictionary because I also detected the problem. Anyway, you can add words to added.txt. I will take care of merging them into the new dictionary. With the new tagger, you fill find many more words tagged, and also many more words tagged with several tags, which might require disambiguation. This is a price we'll have to pay.
@susanaboatto @marcoagpinto to what degree is this still relevant? Diminutives and augmentatives should be generated by the .aff
files, but their tagging is patchy at best. Should we go for a more productive (i.e. quasi-inflectional) approach in the tagger scripts as well?
@ricardojosehlima @marcoagpinto
In this repo, I am rebuilding the Portuguese tagger dictionary. It will have twice as many words as the previous dictionary.
This is a list of diminutive and augmentative adjectives: diminutives.txt Do they make sense in the tagger dictionary? Are they correct and usual words? Or are they automatically generated?
Most of them are not accepted by the spellers (PT, BR...).