snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

German stemmer possible improvements #161

Open OlgaGuselnikova opened 2 years ago

OlgaGuselnikova commented 2 years ago

Hello, Snowball developers team!

I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

  1. Remove ending -ers

Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford

  1. Feminine nouns

-erinnen is replaced with -erin

There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin

  1. Remove -stern

Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen

  1. Remove ending -em

That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst

  1. -ln replaced with -l

Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel

We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

Standart suffix algorithms with described above changes ``` define standard_suffix as ( do ( [substring] R1 among( 'ers' ( delete ) ) ) do ( [substring] R1 among( 'erinnen' ( <- 'erin' ) 'em' 'ern' 'er' ( delete ) 'e' 'en' 'es' ( delete try (['s'] 'nis' delete) ) 's' ( s_ending delete ) ) ) do ( [substring] R1 among( 'stern' ( delete ) 'en' 'er' 'est' 'em' ( delete ) 'st' ( st_ending hop 3 delete ) ) ) do ( [substring] R2 among( 'end' 'ung' ( delete try (['ig'] not 'e' R2 delete) ) 'ig' 'ik' 'isch' ( not 'e' delete ) 'lich' 'heit' ( delete try ( ['er' or 'en'] R1 delete ) ) 'keit' ( delete try ( [substring] R2 among( 'lich' 'ig' ( delete ) ) ) ) ) ) do ( [substring] R1 among( 'ln' ( <- 'l' ) ) ) ) ```

Thanks you for your time!

ojwb commented 1 year ago

Thanks for submitting this and sorry for taking an age to get to it.

Some thoughts:

Remove ending -ers

I think I need to look into this one more.

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

Remove -stern

Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).

Remove ending -em

If system -> syst is the problematic case, maybe it would be better to prevent that happening instead? It's not conflating with another word like morgenstern, but I think it's good to consider if there's a better way to address this.

I notice this appears to be due to -st before -em and the -stern case to be -st before -ern, but a simple restriction to only remove -em and -ern if not preceded by -st seems to affect cases we probably don't want to change. Are there any other

-ln replaced with -l

This seems good too.

ojwb commented 11 months ago

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

85 was particularly motivated by job listings and wanted to conflate e.g. "Verkäufer" and "Verkäuferin" by stemming them to the same stem. That makes sense there but maybe in other contexts that's less helpful, and understemming is the safer option when unsure.

On the "for #85" side, "Verkäuferin" and "Verkäufer" are arguably closer in meaning than "Verkäufer" is to other words we also currently stem to "verkauf" such as "verkaufen". Also the modern trend seems to be away from gendered language (an example from English is that "actor" tends to be used regardless of gender and "actress" gets used less) and assuming similar trends in German (which from #153 I gather is the case) that also tends to argue for the change from #85.

I'm unsure what's best here, but having set down the points above I'm leaning towards the change #85.

ojwb commented 11 months ago

The -ers removal change only changes the stems for two words in the current german/voc.txt:

So I think this needs more investigation - if we can find more examples ending -ers and look at which are better and which worse maybe we can come up with a rule like this but with a condition (e.g. "remove '-ers' unless preceded by a vowel would work for the 3 examples we currently have) or in a different place in the algorithm.

ojwb commented 10 months ago

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index cd303b15..7dfa9c62 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -84,7 +84,10 @@ backwardmode (
     define standard_suffix as (
         do (
             [substring] R1 among(
-                'em' 'ern' 'er'
+                'em'
+                (   not 'syst' delete
+                )
+                'ern' 'er'
                 (   delete
                 )
                 'e' 'en' 'es'

@OlgaGuselnikova That addresses all the "system" cases and should avoid the overstemming you mention. Were there any other cases this approach doesn't address?

ojwb commented 10 months ago

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

I've gone ahead and merged this change.

ojwb commented 10 months ago

-ln replaced with -l

This seems good too.

Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.