meixome / hunspell-gl

hunspell-gl
GNU General Public License v3.0
7 stars 8 forks source link

Do not use slashes in morphological fields #256

Open PanderMusubi opened 6 years ago

PanderMusubi commented 6 years ago

I'm one of the developers for Nuspell and we have come across an issue with the gl_ES.dic dictionary file. From all the 90 dictionaries in Nuspell/Hunspell/MySpell format we use in regression testing, this dictionary gives an error. It fails when Nuspell parses the slashes that are used in the morphological fields. This is the only dictionary using slashes there and that is, for parsing purposes, not wanted there.

The lines triggering this are lines with / but without any flags, such as:

aínda po:adverbio / conxunción
folio po:locución adxectiva / substantivo masculino [n-grama: in folio]

Here the parsers considers all before the slash the word (aínda po:adverbio) and all after the slash the flags (conxunción). Unfortenately, the slash has a special meaning in the .dic files needs to be followed by one or more flags. Hence, the following is not a workaround:

    aínda/ po:adverbio / conxunción

Other lines with a slash in the morphological fields are:

abafado/10,15 po:participio / adxectivo
abafar/200,201,220,221,230,231 po:verbo ts:transitiva / intransitiva / pronominal VOLG: t i pr al:abáfar

There are several ways to solve this, for example:

aínda po:adverbio + conxunción
folio po:locución adxectiva + substantivo masculino [n-grama: in folio]
abafado/10,15 po:participio + adxectivo
abafar/200,201,220,221,230,231 po:verbo ts:transitiva + intransitiva + pronominal VOLG: t i pr al:abáfar

or

aínda po:adverbio & conxunción
folio po:locución adxectiva & substantivo masculino [n-grama: in folio]
abafado/10,15 po:participio & adxectivo
abafar/200,201,220,221,230,231 po:verbo ts:transitiva & intransitiva & pronominal VOLG: t i pr al:abáfar

or

aínda po:adverbio
aínda po:conxunción
folio po:locución adxectiva [n-grama: in folio]
folio po:substantivo masculino [n-grama: in folio]
abafado/10,15 po:participio
abafado/10,15 po:adxectivo
abafar/200,201,220,221,230,231 po:verbo ts:transitiva VOLG: t i pr al:abáfar
abafar/200,201,220,221,230,231 po:verbo ts:intransitiva VOLG: t i pr al:abáfar
abafar/200,201,220,221,230,231 po:verbo ts:pronominal VOLG: t i pr al:abáfar

We would like to help in your choice as we need to start on how exactly morphological fields with multiple pos tags will be processed.

Gallaecio commented 6 years ago

I am OK with replacing them with +.

Moreover, I think we should consider removing non-word parts from lines altogether, at least optionally, during the build step.

However, if Nuspell is meant to be a drop-in replacement for Hunspell, shouldn’t you get your parser to support this case? I mean, I believe the Hunspell parser supports our current syntax because of the space after the word, any slash after that space cannot be meant for flags.

If your parser has this extra requirement for a good reason, such as Nuspell allowing multi-word entries in .dic files, it would be great if you could write up a short document explaining the syntax differences between Hunspell and Nuspell .dic (and .aff?) files, so that we can adapt our buildsystem in the future so that it can build files optimized for each spellchecking engine.

PanderMusubi commented 6 years ago

Just discussed this in our team and read some more details, using a + is only a temporary workaround. Way better solution is to use multiple morphological fields with the same key, e.g.:

abafado/10,15 po:participio po:adxectivo
abafar/200,201,220,221,230,231 po:verbo ts:transitiva ts:intransitiva ts:pronominal VOLG:...

Concerning the drop-in replacement, we have implement a quicker but also stricter parser and only came across this issue now. Only gl_ES uses space slash space. Hunspell documentation actually describes the proposed usage of morphological fields, see e.g. fl:X fl:Y in https://linux.die.net/man/4/hunspell but Hunspell did not complain about it. (Also early on in the documentation, it states that after a slash one or more flags are expected.) So, if you follow this example, you will also have better support in Hunspell too for the fields you use. Does this answer your questions?

meixome commented 6 years ago

Multiple morphological fields could have an impact on performance?

PanderMusubi commented 6 years ago

I can't say for Hunspell, you will have to test that. For Nuspell, this is how it only will be support (as we see it now).

meixome commented 5 years ago

É a solución a isto o que estropeei?

PanderMusubi commented 5 years ago

Sorry, I don't speak this language. Had to put it in a translation website. Do you have a link to the result so I can review that?

Gallaecio commented 5 years ago

@meixome Yes, it is the solution to this that "you broke", but it was unavoidable, most pending pull requests had to break #264. I will eventually solve the issues and update the pull request accordingly.

PanderMusubi commented 5 years ago

Hi all, just friendly reminder:any progress regarding these two issues and an upcoming release with a fix in it? Thanks.