languagetool-org / portuguese-pos-dict

Portuguese POS tagger
GNU Lesser General Public License v2.1
5 stars 2 forks source link

Differences in spelling: Brazil/Portugal #4

Closed jaumeortola closed 8 months ago

jaumeortola commented 2 years ago

I would like to have a clear idea of the main differences in spelling and the scope of these differences.

Portugal: receção, facto/fato (depends on the meaning), contacto, óptimo, acção, ténis, tónica, académico, demónio, António Brazil: recepção, fato, contato, ótimo, ação, tênis, tônica, acadêmico, demônio, Antônio

What other examples we can find?

Documentation about Acordo Otográfico at Priberam: http://www.priberam.pt/docs/CriteriosFLiPAO.pdf http://www.priberam.pt/docs/AcOrtog90.pdf

jaumeortola commented 2 years ago

This is what I observed:

Contato/Contacto: BR Michaelis: priority to contato PT Priberam/PortoEditora: priority to contacto

Recepção/Receção BR Michaelis: only recepção PT Priberam/PortoEditora: priority to receção, recepção marked as BR. [There are 36 words with -pção, some have have doble forms -ção, not all. The PT/BR preferences doesn't seem consistent]

@marcoagpinto @ricardojosehlima Is this correct? What other differences are there in spelling?

ricardojosehlima commented 2 years ago

@jaumeortola this is correct. The vacillation is due to the last orthographic agreement, despite being successful in cases like removing the umlaut from u (lingüística --> linguística), wasn't so in others and then as far as I know optionality of use of the new forms was established. As for other cases, and one that seems to fit in what I said above, is the acute diacritic in the past form of 1st person plural of verbs of the 1st conjugation, that was obligatory in Portugal ("Ontem amámos essas pessoas") and is now optional in Portugal, whereas in Brazil it is never used. My LT already points that "amámos" as wrong, but I don't know how is the status of it in pt.

jaumeortola commented 2 years ago

Thanks. Now I will try to collect all these differences automatically. I will add a special tag for "amámos" so that it doesn't appear in suggestions when it is not explicitly desired.

The big questions we'll have to answer next are:

marcoagpinto commented 2 years ago
  • Can we completely 'unify' the spelling and tagger dictionary? I'd prefer having one dictionary for both tasks. It is easier to maintain, but it can have some disadvantages.

  • How to handle PT/BR differences. I think it is feasible to do it with just one source dictionary plus several lists of equivalences.

Hello @jaumeortola

I believe the tagger dictionary should have all words both PT and BR, since what reports the spelling mistakes is the word dictionary, not the POS one.

About unifying the spelling dictionary, it would be great if the people in charge of the Hunspell dictionaries could add the words in spelling.txt to them.

Also, could there be: spelling_PT.txt spelling_PT_PT.txt spelling_PT_BR.txt ?

This would make spelling more powerful.

Also, could the spelling work like Hunspell? A word in lowercase could be used in upper, lower, capitalisation, and a word starting in uppercase could only be used in uppercase. This would remove the need in some files to have both the words in lower and in uppercase, removing redundancy.

Thanks!

marcoagpinto commented 2 years ago

Also, unifying both tagger and spelling, how would the words be imported into LibreOffice?

jaumeortola commented 2 years ago

I collected PT/BR differences in this folder: https://github.com/languagetool-org/portuguese-pos-dict/tree/main/variants

Some words need to be checked and classified manually because they are not country variants but different words with different meanings (e.g. captar/catar). I will write a script to check them in the PT/BR Hunspell dictionaries, or even on some on-line dictionaries. That will help in the manual revision.

ricardojosehlima commented 2 years ago

@jaumeortola and @marcoagpinto these differences collected by Jaume are very interesting and will require some revision, for example in the excluded file, others that end with a diacritic I'm not sure if they change in Portugal: cocô is cocó? ioiô is ioió? and so on.

p-goulart commented 8 months ago

Most of these are taken care of in the new logic, incl. dialect_alternations.txt. If something here is still an issue, I suggest we open a new thread with fresh data.