languagetool-org / portuguese-pos-dict

Portuguese POS tagger
GNU Lesser General Public License v2.1
5 stars 2 forks source link

[pt-PT] Improved .AFF files for both AO45 and AO90 #11

Closed marcoagpinto closed 7 months ago

marcoagpinto commented 7 months ago

Heya @susanaboatto

The AFF changes added around 1600 verbal forms to pt-PT:

AO45: 3.PTPT_45_new_verbs.txt

AO90: 6.PTPT_90_new_verbs.txt

p-goulart commented 7 months ago

Is this all you need for pt-PT? Or should we be expecting more additions?

marcoagpinto commented 7 months ago

Heya, @p-goulart

For now, it is what I have changed in the .aff .

I will add words in the future to the .dic and I also must do that thing I said of comparing the wordlist of PT-PT with PT-BR.

marcoagpinto commented 7 months ago

This change adds around 1600 verb forms to pt-PT which will fix tons of words appearing as typos while writing text.

p-goulart commented 7 months ago

Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with mo-l[oa]s? to one suffix flag may not add that much coverage.

marcoagpinto commented 7 months ago

Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with mo-l[oa]s? to one suffix flag may not add that much coverage.

I don't understand what you mean.

It adds the words I placed in the first comment: AO45: 3.PTPT_45_new_verbs.txt

AO90: 6.PTPT_90_new_verbs.txt

marcoagpinto commented 7 months ago

They would appear as typos, and now that should no longer happen.

p-goulart commented 7 months ago

Outputting new forms is good, but the important thing for the work we are doing now is making sure that pt-PT verb forms are the same as those output by the PoS tagger dictionary.

As we discussed in the past, we are changing our tagger to include enclitic pronouns as a part of the verb forms. The string ama-te will be a single verb form, ama-te, tagged V$some_tags:PP$some_tags.

The Hunspell .aff files must output the same forms. Otherwise there will be a discrepancy between the speller's verb forms and those of the tagger. Which will cause inconsistencies in the tagging and spellchecking.

marcoagpinto commented 7 months ago
abolimo-la
abolimo-las
abolimo-lo
abolimo-los

Screenshot 2024-01-29 at 09-27-40 Análise de Texto - LanguageTool

I thought @susanaboatto was working on it?

For example: abolimo-la in the future should appear as: VMIP1P0X:PP3FSA00 ?

p-goulart commented 7 months ago

Yes, we are working on it, on the branch that I've just changed this PR to point to. Those two things must happen in parallel.

marcoagpinto commented 7 months ago

What shall I do then?

The words my patch add are valid, but I don't know how to do a: VMIP1P0X:PP3FSA00 in the tags.

Susana is the right person to help with that.

marcoagpinto commented 7 months ago

In simple words,

abolimo-la
abolimo-las
abolimo-lo
abolimo-los

will no longer appear as typos, but they won't show: VMIP1P0X:PP3FSA00 , etc.

marcoagpinto commented 7 months ago

@p-goulart Will you take care of this?

Right now, I can't focus on more flags for the .aff.

p-goulart commented 7 months ago

The words here are fine in and of themselves, I'm just pointing out that they are not all we need.

If I run a simple unmunch test on a basic pt-PT verb like amar/XYPL, I don't get a bunch of forms, e.g.:

ama-te
ama-se
ame-se
ama-o

etc.

You don't need to do anything with the PoS tags. The only thing that is required is for the pt-PT speller scripts to output the same forms as the PoS tagger scripts.

marcoagpinto commented 7 months ago

Ahhhhh....

They are missing?

ama-te
ama-se
ame-se
ama-o

I will work on it in a few days.

Thanks for letting me know.

p-goulart commented 7 months ago

These forms work currently only incidentally, yes, because (for example) both ama and te exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recognise ama-te as a single token.

I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the pt-PT .aff files.)

marcoagpinto commented 7 months ago

These forms work currently only incidentally, yes, because (for example) both ama and te exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recognise ama-te as a single token.

I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the pt-PT .aff files.)

Thanks, that way I can focus on it better.

p-goulart commented 7 months ago

verb-test-out.csv

marcoagpinto commented 7 months ago

Ahhhhhhh

p-goulart commented 7 months ago

I can also attach here the files for other verbs. We'll need stuff like qui-lo, pu-lo, qué-lo, soubé-lo, etc.

marcoagpinto commented 7 months ago

Sure, I will add the rules bit by bit, I won't do all at the same time.