snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
761 stars 174 forks source link

German stemmer possible improvements #161

Open OlgaGuselnikova opened 2 years ago

OlgaGuselnikova commented 2 years ago

Hello, Snowball developers team!

I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

  1. Remove ending -ers

Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford

  1. Feminine nouns

-erinnen is replaced with -erin

There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin

  1. Remove -stern

Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen

  1. Remove ending -em

That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst

  1. -ln replaced with -l

Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel

We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

Standart suffix algorithms with described above changes ``` define standard_suffix as ( do ( [substring] R1 among( 'ers' ( delete ) ) ) do ( [substring] R1 among( 'erinnen' ( <- 'erin' ) 'em' 'ern' 'er' ( delete ) 'e' 'en' 'es' ( delete try (['s'] 'nis' delete) ) 's' ( s_ending delete ) ) ) do ( [substring] R1 among( 'stern' ( delete ) 'en' 'er' 'est' 'em' ( delete ) 'st' ( st_ending hop 3 delete ) ) ) do ( [substring] R2 among( 'end' 'ung' ( delete try (['ig'] not 'e' R2 delete) ) 'ig' 'ik' 'isch' ( not 'e' delete ) 'lich' 'heit' ( delete try ( ['er' or 'en'] R1 delete ) ) 'keit' ( delete try ( [substring] R2 among( 'lich' 'ig' ( delete ) ) ) ) ) ) do ( [substring] R1 among( 'ln' ( <- 'l' ) ) ) ) ```

Thanks you for your time!

ojwb commented 1 year ago

Thanks for submitting this and sorry for taking an age to get to it.

Some thoughts:

Remove ending -ers

I think I need to look into this one more.

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

Remove -stern

Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).

Remove ending -em

If system -> syst is the problematic case, maybe it would be better to prevent that happening instead? It's not conflating with another word like morgenstern, but I think it's good to consider if there's a better way to address this.

I notice this appears to be due to -st before -em and the -stern case to be -st before -ern, but a simple restriction to only remove -em and -ern if not preceded by -st seems to affect cases we probably don't want to change. Are there any other

-ln replaced with -l

This seems good too.

ojwb commented 1 year ago

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

85 was particularly motivated by job listings and wanted to conflate e.g. "Verkäufer" and "Verkäuferin" by stemming them to the same stem. That makes sense there but maybe in other contexts that's less helpful, and understemming is the safer option when unsure.

On the "for #85" side, "Verkäuferin" and "Verkäufer" are arguably closer in meaning than "Verkäufer" is to other words we also currently stem to "verkauf" such as "verkaufen". Also the modern trend seems to be away from gendered language (an example from English is that "actor" tends to be used regardless of gender and "actress" gets used less) and assuming similar trends in German (which from #153 I gather is the case) that also tends to argue for the change from #85.

I'm unsure what's best here, but having set down the points above I'm leaning towards the change #85.

ojwb commented 1 year ago

The -ers removal change only changes the stems for two words in the current german/voc.txt:

So I think this needs more investigation - if we can find more examples ending -ers and look at which are better and which worse maybe we can come up with a rule like this but with a condition (e.g. "remove '-ers' unless preceded by a vowel would work for the 3 examples we currently have) or in a different place in the algorithm.

ojwb commented 1 year ago

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index cd303b15..7dfa9c62 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -84,7 +84,10 @@ backwardmode (
     define standard_suffix as (
         do (
             [substring] R1 among(
-                'em' 'ern' 'er'
+                'em'
+                (   not 'syst' delete
+                )
+                'ern' 'er'
                 (   delete
                 )
                 'e' 'en' 'es'

@OlgaGuselnikova That addresses all the "system" cases and should avoid the overstemming you mention. Were there any other cases this approach doesn't address?

ojwb commented 1 year ago

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

I've gone ahead and merged this change.

ojwb commented 1 year ago

-ln replaced with -l

This seems good too.

Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.

ojwb commented 1 month ago

-ln replaced with -l

Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.

If we handle this removal in step 1 then we avoid conflating Keller and Kellner without making anything else worse (at least in the sample vocabulary list snowball-data/german.voc.txt). This is the change I tested (removing -lns too means we still conflate rasseln and rasselns but doesn't affect anything else in the sample vocabulary):

--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -98,6 +98,9 @@ backwardmode (
                 's'
                 (   s_ending delete
                 )
+                'ln' 'lns'
+                (   <- 'l'
+                )
             )
         )
         do (

I'm running a script to collate a larger (and perhaps more modern) German wordlist from a de.wikipedia.org dump - then we can see how looks for a more comprehensive vocabulary list (and see if -lns is worthwhile - if it really affects just a single word that is probably not worthwhile).

ojwb commented 1 month ago

Testing on a larger list, there are more cases where if we do -ln -> -l then also doing -lns -> -l is useful, mostly nouns which happen to end -ln - that means we now generate a stem which isn't linguistically correct for these, but that's not a problem in the intended domain of use, the only concern is if that introduces unwanted conflation, and it seems in practice it doesn't. Pushing changes to implement that.

ojwb commented 1 month ago

Remove -stern

Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).

I had another look at this, and tried

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index c0973c72..6de281f8 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -88,7 +88,11 @@ backwardmode (
                 (   not 'syst' // don't remove -em from words ending -system
                     delete
                 )
-                'ern' 'er'
+                'ern'
+                (   not ('st' R1) // don't remove -stern from morgenstern, etc
+                    delete
+                )
+                'er'
                 'erin' 'erinnen' // conflate female versions of nouns
                 (   delete
                 )

This gives:

A total of 21 words changed stem
* 10 words changed stem but aren't interesting
  1 merges of groups of stems:
  { morgenstern } + { morgensterne }
* 9 splits of groups of stems:
  { abend abende abendlichen abends | abendstern }
  { eisgeschwister | eisgeschwistern }
  { finster finstere finsteren finsteres | finstern }
  { geschwister | geschwistern }
  { höllengeister | höllengeistern }
  { leit leite leiten leiter leiterin leiters leitest | leitstern }
  { naturgeister | naturgeistern }
  { philister | philistern }
  { vorg | vorgestern }

Of those, morgenstern, abendstern and leitstern are all something-star cases which seem minor improvements; vorg doesn't seem to be a word; the other splits seem unhelpful. Also fenstern is now conflated with fensternische instead of with fenster+fensters, which seems slightly worse but reasonable.

Overall this doesn't seem a worthwhile change. (I tried the extra R1 check to try reduce unwanted changes but it's not really helping - without it there were about twice as many changes, but not worthwhile overall either.)

There's only one motivating example for this change, and no feedback when I asked for it some time ago now.

We could start an exception list of stems to not remove stern from, but I'm not convinced they're common enough to really justify it. My current conclusion is not to try to address this one, but I'm open to discussion, especially if there are more examples which fall into a pattern.

ojwb commented 1 month ago

The change to remove -ers seems the wrong approach to me - the problem here is really that we're overstemming Förderer and Förderern - it would be better to stem those to forder rather than the current ford (which collides with a car brand), though I'm not sure how practical that is to resolve satisfactorily. Fighting overstemming with more overstemming seems problematic though.

@OlgaGuselnikova Do you have more examples of cases that the -ers change helps?