Closed marcoagpinto closed 7 months ago
Is this all you need for pt-PT
? Or should we be expecting more additions?
Heya, @p-goulart
For now, it is what I have changed in the .aff .
I will add words in the future to the .dic and I also must do that thing I said of comparing the wordlist of PT-PT with PT-BR.
This change adds around 1600 verb forms to pt-PT which will fix tons of words appearing as typos while writing text.
Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with mo-l[oa]s?
to one suffix flag may not add that much coverage.
Sure, but the most important question is whether it outputs largely the same forms as those output by the PoS tagger. Adding forms with
mo-l[oa]s?
to one suffix flag may not add that much coverage.
I don't understand what you mean.
It adds the words I placed in the first comment: AO45: 3.PTPT_45_new_verbs.txt
AO90: 6.PTPT_90_new_verbs.txt
They would appear as typos, and now that should no longer happen.
Outputting new forms is good, but the important thing for the work we are doing now is making sure that pt-PT
verb forms are the same as those output by the PoS tagger dictionary.
As we discussed in the past, we are changing our tagger to include enclitic pronouns as a part of the verb forms. The string ama-te
will be a single verb form, ama-te
, tagged V$some_tags:PP$some_tags
.
The Hunspell .aff
files must output the same forms. Otherwise there will be a discrepancy between the speller's verb forms and those of the tagger. Which will cause inconsistencies in the tagging and spellchecking.
abolimo-la
abolimo-las
abolimo-lo
abolimo-los
I thought @susanaboatto was working on it?
For example: abolimo-la
in the future should appear as:
VMIP1P0X:PP3FSA00
?
Yes, we are working on it, on the branch that I've just changed this PR to point to. Those two things must happen in parallel.
What shall I do then?
The words my patch add are valid, but I don't know how to do a: VMIP1P0X:PP3FSA00
in the tags.
Susana is the right person to help with that.
In simple words,
abolimo-la
abolimo-las
abolimo-lo
abolimo-los
will no longer appear as typos, but they won't show: VMIP1P0X:PP3FSA00
, etc.
@p-goulart Will you take care of this?
Right now, I can't focus on more flags for the .aff.
The words here are fine in and of themselves, I'm just pointing out that they are not all we need.
If I run a simple unmunch
test on a basic pt-PT
verb like amar/XYPL
, I don't get a bunch of forms, e.g.:
ama-te
ama-se
ame-se
ama-o
etc.
You don't need to do anything with the PoS tags. The only thing that is required is for the pt-PT
speller scripts to output the same forms as the PoS tagger scripts.
Ahhhhh....
They are missing?
ama-te
ama-se
ame-se
ama-o
I will work on it in a few days.
Thanks for letting me know.
These forms work currently only incidentally, yes, because (for example) both ama
and te
exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recognise ama-te
as a single token.
I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the pt-PT
.aff
files.)
These forms work currently only incidentally, yes, because (for example) both
ama
andte
exist as separate words. But if the speller dictionary's inflector doesn't output them, they won't be considered correctly spelt... which is an issue, since the new tokeniser will recogniseama-te
as a single token.I will attach here a list of forms needed for a regular verb. (This doesn't include a bunch of irregular verbs that are simply not handled by the
pt-PT
.aff
files.)
Thanks, that way I can focus on it better.
Ahhhhhhh
I can also attach here the files for other verbs. We'll need stuff like qui-lo
, pu-lo
, qué-lo
, soubé-lo
, etc.
Sure, I will add the rules bit by bit, I won't do all at the same time.
Heya @susanaboatto
The AFF changes added around 1600 verbal forms to pt-PT:
AO45: 3.PTPT_45_new_verbs.txt
AO90: 6.PTPT_90_new_verbs.txt