Closed jaumeortola closed 8 months ago
This is what I observed:
Contato/Contacto: BR Michaelis: priority to contato PT Priberam/PortoEditora: priority to contacto
Recepção/Receção BR Michaelis: only recepção PT Priberam/PortoEditora: priority to receção, recepção marked as BR. [There are 36 words with -pção, some have have doble forms -ção, not all. The PT/BR preferences doesn't seem consistent]
@marcoagpinto @ricardojosehlima Is this correct? What other differences are there in spelling?
@jaumeortola this is correct. The vacillation is due to the last orthographic agreement, despite being successful in cases like removing the umlaut from u (lingüística --> linguística), wasn't so in others and then as far as I know optionality of use of the new forms was established. As for other cases, and one that seems to fit in what I said above, is the acute diacritic in the past form of 1st person plural of verbs of the 1st conjugation, that was obligatory in Portugal ("Ontem amámos essas pessoas") and is now optional in Portugal, whereas in Brazil it is never used. My LT already points that "amámos" as wrong, but I don't know how is the status of it in pt.
Thanks. Now I will try to collect all these differences automatically. I will add a special tag for "amámos" so that it doesn't appear in suggestions when it is not explicitly desired.
The big questions we'll have to answer next are:
Can we completely 'unify' the spelling and tagger dictionary? I'd prefer having one dictionary for both tasks. It is easier to maintain, but it can have some disadvantages.
How to handle PT/BR differences. I think it is feasible to do it with just one source dictionary plus several lists of equivalences.
Hello @jaumeortola
I believe the tagger dictionary should have all words both PT and BR, since what reports the spelling mistakes is the word dictionary, not the POS one.
About unifying the spelling dictionary, it would be great if the people in charge of the Hunspell dictionaries could add the words in spelling.txt to them.
Also, could there be: spelling_PT.txt spelling_PT_PT.txt spelling_PT_BR.txt ?
This would make spelling more powerful.
Also, could the spelling work like Hunspell? A word in lowercase could be used in upper, lower, capitalisation, and a word starting in uppercase could only be used in uppercase. This would remove the need in some files to have both the words in lower and in uppercase, removing redundancy.
Thanks!
Also, unifying both tagger and spelling, how would the words be imported into LibreOffice?
I collected PT/BR differences in this folder: https://github.com/languagetool-org/portuguese-pos-dict/tree/main/variants
Some words need to be checked and classified manually because they are not country variants but different words with different meanings (e.g. captar/catar). I will write a script to check them in the PT/BR Hunspell dictionaries, or even on some on-line dictionaries. That will help in the manual revision.
@jaumeortola and @marcoagpinto these differences collected by Jaume are very interesting and will require some revision, for example in the excluded file, others that end with a diacritic I'm not sure if they change in Portugal: cocô is cocó? ioiô is ioió? and so on.
Most of these are taken care of in the new logic, incl. dialect_alternations.txt
. If something here is still an issue, I suggest we open a new thread with fresh data.
I would like to have a clear idea of the main differences in spelling and the scope of these differences.
Portugal: receção, facto/fato (depends on the meaning), contacto, óptimo, acção, ténis, tónica, académico, demónio, António Brazil: recepção, fato, contato, ótimo, ação, tênis, tônica, acadêmico, demônio, Antônio
What other examples we can find?
Documentation about Acordo Otográfico at Priberam: http://www.priberam.pt/docs/CriteriosFLiPAO.pdf http://www.priberam.pt/docs/AcOrtog90.pdf