jeppebundsgaard / stavekontrolden

A frontend for creating and maintaing hunspell dictionaries.
GNU General Public License v3.0
7 stars 1 forks source link

Stemming irregular nouns #4

Open mekanixdk opened 2 years ago

mekanixdk commented 2 years ago

We are using hunspell in elasticsearch to help us stem irregular nouns, but it doesn't really give us the expected result.

Fx "gulerod" (carrot) vs "gulerødder" (carrots) are stemmed to "gulerod" (word root) and "gulerødder" respectively.

I have tried stemming the words using https://www.npmjs.com/package/nodehun as well with the same outcome, which leads me to think it is a hunspell/dictionary issue.

I have tried out a couple of different da_DK and nb_NO fx. from https://stavekontrolden.dk/?dictionaries=1, LibreOffice and debian all various (older) versions of the first.

A little test-case

    const {Nodehun} = require('nodehun');
    const fs = require('fs');

    const affix = fs.readFileSync(
        `./elasticsearch/dictionaries/hunspell/yy_YY/yy_YY.aff`
    );
    const dictionary = fs.readFileSync(
        `./elasticsearch/dictionaries/hunspell/yy_YY/yy_YY.dic`
    );
    const nodehun = new Nodehun(affix, dictionary);

    const words = [
        'gulerod',
        'gulerødder',
        'mand',
        'mænd',
        'mønster',
        'mønstre'
    ];

    for (let word of words) {
        const stems = await nodehun.stem(word);
        console.dir({word, stems});
    }

which outputs

{ word: 'gulerod', stems: [ 'gulerod' ] }
{ word: 'gulerødder', stems: [ 'gulerødder' ] }
{ word: 'mand', stems: [ 'mand', 'mande' ] }
{ word: 'mænd', stems: [ 'mænd' ] }
{ word: 'mønster', stems: [ 'mønster' ] }
{ word: 'mønstre', stems: [ 'mønstre', 'mønster' ] }

As you can see it handles mønster/mønstre correctly, but here the irregularity aren't with the vowels - could that be an issue?

Now the question(s): Is this due to hunspell? Or the dictionary? And is there anything we can do to fix this?

lajo-gh commented 2 years ago

Hello

This is a consequence of how these words are entered into the dictionary (thus not a hunspell issue).

About two years ago, I realized that a number of irregular nouns can be considered a combination of regular singular and a different regular plural. E.g. "gulerod": The singular forms are gulerod, golerods, guleroden, gulerodens (carrot, carrot's, the carrot, the carrot's), which is regular declension of the root "gulerod". The plural forms are gulerødder, gulerødders, gulerødderne, gulerøddernes (carrots, carrots', the carrots, the carrots'), which is regular declension if you stick to the plural form "gulerødder". Thus, instead of having one noun with strong declension and 8 forms, I changed to a singular noun with regular declension and a plural nouns with regular declension (a new class in the dictionary).

I see this becomes a problem if you want to harvest the irregular nouns, and it also means that the dictionary cannot connect the singular and the regular form. In the version of Stavekontrolden from two years ago, however, strong declension did not connect with the root form, so that was not a loss a the time. Jeppe has since then added the feature of giving strong forms a tag for the root form.

The good news, I hope, is that at the time I made a list of the words I changed in this way. Just the basic word, not compound word, e.g. "datter" (daughter), not "teenagedatter" (teenager daughter). The list is here: Ental og flertal for sig.txt

As my own observation, many of these irregular declensions relate to family members (mother, father, brother, daughter), body parts (hand, feet, toe) or some of the most common domestic animals. Probably because these are words that have been used so often that they can retain a strong declension, despite our tendency to use the general, weak declensions.

You may also want to consider words like "laboratorium" (laboratory), where the "um" goes away in almost all forms. Is that regular or irregular? In the present dictionary, these words have their own declension, i.e. to the dictionary, it is regular. Take a look at the .aff file to identify the two classes that define these words (um/us, and um with -e as a possibility, "laboratorie" has in recent years become an accepted form). This should make it easy for you to find these words, if they have your interest.

From your code, I think "mønster" represents a misunderstanding. Plural is "mønstre" (one of the possible regular declensions). However "mønstre" can also be a verb as in the sentence "officeren mønstrede rekrutterne" (the officer inspected the recruits), which is why "mønstre" gives you both "mønster" (a noun) and "mønstre" (a verb) as root.

If you want to experiment with the dictionary from before I worked on it, you can use version 2.4 from https://extensions.libreoffice.org/en/extensions/show/stavekontrolden-danish-dictionary. However, in that version of the dictionary, there is no specification of roots for the strong declensions, and in a number of cases, a noun is given as strong declension simply because it can be spelled in more than one way, which was originally coded as giving all possible spellings as strong declension. In the current version, different spelling is given as different words.

I hope this helps.

Best regards, Lars

lajo-gh commented 2 years ago

An additional thought. It struck me that most of these irregularities exhibit a predictable vowel change (an observation that might be common knowledge for linguists, but was a new thought to me). Writing only the Danish version of the nouns, we have series where "o" in singular becomes "ø" in plural: barn/børn bog/bøger bonde/bønder broder/brødre datter/døtre fader/fædre fod/fødder klo/kløer ko/køer moder/mødre rod/rødder so/søer And another set is "a" or "å" becoming "æ" in plural: and/ænder gås/gæs hånd/hænder kraft/kræfter mand/mænd nat/nætter rå/ræer stad/stæder stang/stænger tand/tænder tang/tænger tå/tæer The plural form differs, but these differences are within the normal weak differences. Of natively Danish words in my list, that leaves only høne/høns and øje/øjne.

It seems difficult to make hunspell rules for these cases, as the vowels to be changed are not in a fixed position. But interesting (for someone like me, who has not thought about it before), and might be of relevance to your project?

Best regards, Lars

mekanixdk commented 2 years ago

Thank you for your explanation. Knowing that this is a limitation/design choice of the dictionary helps us not chase this any further.

As the list of irregular nouns are limited we will probably opt for an approach where we map the irregular plural forms to their singular root when stemming.