snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

Norwegian stemming of words that end with "ers" #175

Open karianne opened 1 year ago

karianne commented 1 year ago

In Norwegian, we have a number of words where the noun in its singular, indefinite form end with "ers", for instance "kontrovers" ([a] controversy), "univers" ([a] universe) and "ters" ([a] third - in musical terminology). Other forms for these words are "kontroversen/kontroverser/kontroversene/kontroversens", "universet/universer/universene/universets" and "tersen/terser/tersene/tersens". Right now, these are not being stemmed correctly. The first turn into "kontrov", "univ" and "ter", respectively, while the other forms turn into "kontrovers", "univers" and "ters". This is not correct.

Would the best way to solve this be to add exceptions to the stemmer?

ojwb commented 1 year ago

When you say "a number of words", roughly how many cases are there?

The rule to remove ers doesn't fire for ters like it does for kontrovers and univers essentially because ters is too short (in detail because the ers suffix isn't entirely in R1 - here | marks the start of R1: ter|s). Then the suffix s gets removed instead (because it's the longest of the suffixes entirely in R1), which isn't actually what we want here. Knowing if ters is the only word affected would be useful to decide what's best to do.

For the others, adding suffixes to reduce them suitably seems to work:

diff --git a/algorithms/norwegian.sbl b/algorithms/norwegian.sbl
index 39f4aff0..26bd3e1f 100644
--- a/algorithms/norwegian.sbl
+++ b/algorithms/norwegian.sbl
@@ -40,7 +40,7 @@ backwardmode (

             'a' 'e' 'ede' 'ande' 'ende' 'ane' 'ene' 'hetene' 'en' 'heten' 'ar'
             'er' 'heter' 'as' 'es' 'edes' 'endes' 'enes' 'hetenes' 'ens'
-            'hetens' 'ers' 'ets' 'et' 'het' 'ast'
+            'hetens' 'ers' 'ersen' 'erser' 'erset' 'ersets' 'ersene' 'ersens' 'ets' 'et' 'het' 'ast'
                 (delete)
             's'
                 (s_ending or ('k' non-v) delete)

That fixes all cases for kontrovers and univers at least, and it doesn't affect the stemming of any words in the test vocabulary of ~20000 words we currently have, which is a good sign.

The alternative approach would be to somehow prevent ers being removed from kontrovers and univers - I'm guessing that would give the more linguistically correct stem, though for Snowball's intended uses that isn't important so long as the stemmed forms don't collide with those of unrelated words. I notice both the affected cases end -vers but looking at the test vocabulary we have andelshaver and andelshavers which both currently stem to andelshav and a rule like "remove ers unless preceded by v" would break that.

I think I've spotted a case that ers removal currently incorrectly conflates though - revers (reverse of a coin or reverse gear, apparently) gets stemmed to rev but that seems to mean a "fox" or a "reef". It looks to me (with my very scant knowledge of Norwegian) like revers/reversen/reverser/reversene is another example of what you're reporting, but one that would be better fixed by not removing ers from revers rather than adding the extra suffixes to the list to remove.

The words ending vers in our test vocabulary are:

ojwb commented 1 year ago

arbeidsgivers (this doesn't seem to be a valid word, but arbeidsgiver is - typo maybe?)

Oh, this is probably a genitive -s suffix, so would translate to English as "employer's", so it ideally would stem to the same thing as arbeidsgiver (currently both stem to arbeidsgiv).

I'm struggling to see a rule (or set of rules) for handling words ending -vers which works for all the cases above.

I had a look for a more comprehensive Norwegian word list but haven't found one yet. If you know of a source that'd be helpful. Otherwise maybe I should try generating one from wikipedia data (we have a script to automate that).

karianne commented 1 year ago

Thank you for your feedback!

The case of "revers" is actually harder, as it can both mean "reverse of a coin or reverse gear", as you say, but also is a plural possessive form of "rev" (fox). The English translation is "foxes'", as in "Several foxes' fur were matted". "Andelhavers" and "arbeidgivers" should probably not be changed, since they are both singular possessive forms of nouns that end with "er" in singular form. "tryllevers", which is indeed a magic verse, has the same problem.

I have compiled a list of the words I think are relevant, according to naob.no (Norwegian online dictionary). It's not a lot, in other words, but some of these words can be combined with others, as Norwegian has the concept of "combined words" (sammensatte ord), which gives us for instance "tryllevers", "sangvers" (song verse), "salmevers" (psalm verse), "barnevers" (child verse or child poem), "bibelvers" (bible verse), "bordvers" (saying grace before eating food), "matvers" (the same as previous) and probably others, too. Here is the list over nouns:

There are more words, but all of these are twins of other words. For instance "kammers" which can be a singular form of "small room", but also a plural possessive form of "kam" (comb). I don't think there can be a common rule for all of these words, at least not one that I can think of. That's why I asked if adding exceptions to the stemmer might be the cleanest solution here.

ojwb commented 1 year ago

The case of "revers" is actually harder, as it can both mean "reverse of a coin or reverse gear", as you say, but also is a plural possessive form of "rev" (fox). The English translation is "foxes'", as in "Several foxes' fur were matted".

OK, so this one is a genuinely ambiguous case. Probably the first meaning is going to occur more commonly than the second, but neither is a particularly common word and which is more likely will depend somewhat on the nature of the data. Interpreting it as the second as we currently do isn't unreasonable.

Thanks for the list - that's really helpful. I'll study it and see if I can come up with a plan.

ojwb commented 1 year ago

These seem to fall into two sets.

One is the short words where we don't remove ers because it's not entirely in R1, then we remove s instead because it's the longest suffix which is entirely in R1. It looks to me like that can be dealt with by adjusting s removal to not be done preceded by er:

diff --git a/algorithms/norwegian.sbl b/algorithms/norwegian.sbl
index 39f4aff0..cc29c703 100644
--- a/algorithms/norwegian.sbl
+++ b/algorithms/norwegian.sbl
@@ -21,7 +21,7 @@ stringdef o/   '{U+00F8}'

 define v 'aeiouy{ae}{ao}{o/}'

-define s_ending  'bcdfghjlmnoprtvyz'
+define s_ending  'bcdfghjlmnoptvyz'

 define mark_regions as (

@@ -43,7 +43,7 @@ backwardmode (
             'hetens' 'ers' 'ets' 'et' 'het' 'ast'
                 (delete)
             's'
-                (s_ending or ('k' non-v) delete)
+                (s_ending or ('r' not 'e') or ('k' non-v) delete)
             'erte' 'ert'
                 (<-'er')
         )

('r' not 'e' checks for r not preceded by e because we're in backwardmode here.)

This fixes avers, mers, overs, pers, ters and vers from your list.

It changes two words in the existing voc.txt: tvers (adverb) and ymers (proper noun?) no longer have s removed, both of which seem OK changes. Because of ers being removed in preference if in R1, this change can only affect words of the form <zero or more vowels><one or more consonants><zero or more vowels>ers.

The other case is where we're removing ers and making a different stem to other forms of the word. I'm still looking at that.

karianne commented 1 year ago

That looks like a good change! Ymers is indeed the possessive form of the name Ymer (the original Jotun/giant being in Norse mythology), so while it's not ideal that we stem Ymer differently from Ymers, it's probably not a showstopper.