snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Stemming feminine nouns in german stemmer #85

Closed sepal closed 2 weeks ago

sepal commented 6 years ago

It would good to stem the feminine version of nous in the german stemmers. For example "Verkäuferin" should be stemmed to "Verkauf" like it it is done for the masculine form "Verkäufer". This is especially important for occupations for example on a job boards.

Is there a reason not to do that, or has no one gotten arround to do it or come up with an approriate algorithm? I might try to do a PR if there is any chance of getting it accepted.

ojwb commented 5 years ago

There's a trade-off between complexity of the stemming algorithm and how complete a job it does, and for the intended domain of use (information retrieval) you inevitably reach a situation of diminishing returns. So there are inevitably going to be forms which could be conflated but aren't.

A job search does seem to be a motivating case for handling this case better. I think it would be worthwhile if it can be done with minor changes to the existing algorithm and without affecting cases which aren't feminine noun forms (particular care is needed here for words which end -in but aren't feminine nouns).

ojwb commented 5 years ago

I had a look at this, and came up with adding -erin and -erinnen as endings to remove in step 1(a) like so:

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index 61f24ef9..efae755f 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -78,7 +78,7 @@ backwardmode (
     define standard_suffix as (
         do (
             [substring] R1 among(
-                'em' 'ern' 'er'
+                'em' 'ern' 'er' 'erin' 'erinnen'
                 (   delete
                 )
                 'e' 'en' 'es'

In snowball-data's 35033 word german/voc.txt the patch above alters the stemming of 60 words - I had a quick look (though my German is very rusty) and the only false match I spotted was:

nitroglyzerin -> nitroglyz

That particular case looks harmless (since nothing unconnected should stem to nitroglyz), but it does highlight that some chemical compounds also end -erin (cycloserin seems to be another example).

There are also female forms which don't match this pattern - e.g. Ärztin, Autorin, Matrosin, Närrin, Pastorin, Redaktorin, Sekretärin. Always removing -in is clearly too aggressive (for example, a lot of German words end -ein), but perhaps with some check on what comes before it could be feasible. We don't have to handle every case for a change to be an improvement of course.

sepal commented 5 years ago

Yeah, this issue is quite complicated to tackle. We ended up with writing a custom solr config, that removes any in and innen. For a general Stemmer this is indeed to aggresive, since it would for example stem Benzin to Benz. Since we run jobboards, it's less of a problem for our usecase.

But thanks for looking into it!

ojwb commented 2 weeks ago

Looking into this some more, as well as the "Nitroglyzerin -> nitroglyz" and "Cycloserin -> cyclos" cases I found more words ending "erin" which don't seem to be the female version of something:

The "Schwerin" and "schwer" conflation seems the most ... serious.

These cases seem to all seem to be people, places and chemicals, but none of them seem likely to be that problematic in practice. I didn't list all the people and places above; I suspect there are more chemicals too.

So overall this seems a plausible change. We could instead replace erinnen with erin which would conflate singular and plural of the feminine versions while not attempting to conflate with the masculine versions, but that seems less useful in general.