languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.32k stars 1.39k forks source link

use Morfologik for Portuguese? #6079

Open danielnaber opened 2 years ago

danielnaber commented 2 years ago

Is there a reason not to use Morfologik for Portuguese? It's faster than hunspell, and most languages in LT use it (except those that have compounds, I think Portuguese doesn't?).

@jaumeortola Do you have an opinion on this?

jaumeortola commented 2 years ago

Certainly, we'll be much better with Morfologik. The only problem that comes to my mind is that of the varieties of language. We have to figure out how many spelling dictionaries we really need and what are the differences.

udomai commented 2 years ago

I think we might need two spelling dictionaries for pt-PT and pt-BR.

Some words might not be in use in pt-PT, like xícara (= chávena), others take different accents (metro vs. metrô, ténis vs. tênis). One tagging problem might be verbs ending in -ar in the pretérito perfeito simples, which end in -amos in pt-BR (-ámos in pt-PT), and are homographs of the present tense in pt-BR.

danielnaber commented 2 years ago

Maybe we should just use the same dicts we have now and export them: https://github.com/languagetool-org/languagetool/tree/9d3c36600f369cba03105343b4f0550a016e6cdf/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/hunspell

tiff commented 2 years ago

Related: https://github.com/languagetool-org/languagetool/issues/2082 https://github.com/languagetool-org/languagetool/issues/199

jaumeortola commented 2 years ago

Number of lines (words) in the different Hunspell dictionaries:

12,558,170 pt_PT (~ 5 milions with enclitics: -nos, -vos -te, -se...) 10,545,031 pt_BR (~ 7 milions with enclitics: -te, -lhe, -lhes...) 10,914,777 pt_AO 10,004,463 pt_MZ

Lines in the tagger dictionary: 1,131,147 added.txt: 7067

The number and the distribution of enclitics are surprisingly different in PT and BR. Most common enclitics PT vs. BR:

501396 | -nos | 532194 | -te
99440 | -vos | 531876 | -lhe
99440 | -te | 531857 | -lhes
99440 | -se | 527201 | -vos
99440 | -me | 494055 | -nos
99440 | -lhos | 424264 | -me
99440 | -lho | 379598 | -se
99440 | -lhes | 331421 | -se-lhe
99440 | -lhe | 331407 | -se-lhes
99440 | -lhas | 254694 | -a

In pt_PT there are millions of forms with prefixes (and suffixes) that don't make much sense. See:

acometo
antiacometo
reacometo
biacometo
triacometo
tetraacometo
pentaacometo
hexaacometo
cometo
anticometo
recometo
bicometo
tricometo
tetracometo
pentacometo
hexacometo

For example, one and a half million words (the whole dictionary?) with tetra-:

tetraxenotransplantes
tetraxenotransplantíssimo
tetraxenotransplantíssima
tetraxenotransplantíssimos
tetraxenotransplantíssimas
tetraxenotransplantice
tetraxenotransplantices
tetraxenotransplante

Probably, most of these features (prefixes, suffixes, enclitics...) are not being used currently in LT.

udomai commented 2 years ago

That looks interesting! Something is probably wrong with the occurrences of enclitics pt-PT vs. -BR. The numbers must be higher in -PT (since in all but the highest registers, the postponed object pronoun is far less frequent).

jaumeortola commented 2 years ago

I have been looking into some problems in the Hunspell Portuguese dictionaries: https://github.com/languagetool-org/languagetool/issues/6298 BTW, in today's and yesterday's nightly diffs there are some unexpected changes: many spelling suggestions have changed, and you don't know why. There are some changes in German as well, but not so many. Any idea about this, @danielnaber? https://internal1.languagetool.org/regression-tests/via-http/2022-09-01/pt-BR/result_java_HUNSPELL_RULE.html https://internal1.languagetool.org/regression-tests/via-http/2022-09-02/pt-BR/result_java_HUNSPELL_RULE.html

Instead of trying to solve these problems, we should convert the spelling dictionaries to Morfologik. Possible obstacles:

I would need several days (a whole week?) to do it, with the support of @susanaboatto. I can start a branch and see if it is doable in a reasonable amount of time.

danielnaber commented 2 years ago

The changes in the German speller in today's diff are simply because words have been added to spelling.txt, I think.

susanaboatto commented 2 years ago

I wonder if that's also the reason we have changes in the PT speller. I have been editing spelling.txt thoroughly this week.