Open OlgaGuselnikova opened 2 years ago
Thanks for submitting this and sorry for taking an age to get to it.
Some thoughts:
Remove ending -ers
I think I need to look into this one more.
-erinnen is replaced with -erin
This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).
Remove -stern
Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).
Remove ending -em
If system -> syst is the problematic case, maybe it would be better to prevent that happening instead? It's not conflating with another word like morgenstern, but I think it's good to consider if there's a better way to address this.
I notice this appears to be due to -st before -em and the -stern case to be -st before -ern, but a simple restriction to only remove -em and -ern if not preceded by -st seems to affect cases we probably don't want to change. Are there any other
-ln replaced with -l
This seems good too.
-erinnen is replaced with -erin
This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).
On the "for #85" side, "Verkäuferin" and "Verkäufer" are arguably closer in meaning than "Verkäufer" is to other words we also currently stem to "verkauf" such as "verkaufen". Also the modern trend seems to be away from gendered language (an example from English is that "actor" tends to be used regardless of gender and "actress" gets used less) and assuming similar trends in German (which from #153 I gather is the case) that also tends to argue for the change from #85.
I'm unsure what's best here, but having set down the points above I'm leaning towards the change #85.
The -ers removal change only changes the stems for two words in the current german/voc.txt
:
So I think this needs more investigation - if we can find more examples ending -ers
and look at which are better and which worse maybe we can come up with a rule like this but with a condition (e.g. "remove '-ers' unless preceded by a vowel would work for the 3 examples we currently have) or in a different place in the algorithm.
Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):
diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index cd303b15..7dfa9c62 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -84,7 +84,10 @@ backwardmode (
define standard_suffix as (
do (
[substring] R1 among(
- 'em' 'ern' 'er'
+ 'em'
+ ( not 'syst' delete
+ )
+ 'ern' 'er'
( delete
)
'e' 'en' 'es'
@OlgaGuselnikova That addresses all the "system" cases and should avoid the overstemming you mention. Were there any other cases this approach doesn't address?
Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):
I've gone ahead and merged this change.
-ln replaced with -l
This seems good too.
Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.
-ln replaced with -l
Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.
If we handle this removal in step 1 then we avoid conflating Keller
and Kellner
without making anything else worse (at least in the sample vocabulary list snowball-data/german.voc.txt
). This is the change I tested (removing -lns
too means we still conflate rasseln
and rasselns
but doesn't affect anything else in the sample vocabulary):
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -98,6 +98,9 @@ backwardmode (
's'
( s_ending delete
)
+ 'ln' 'lns'
+ ( <- 'l'
+ )
)
)
do (
I'm running a script to collate a larger (and perhaps more modern) German wordlist from a de.wikipedia.org dump - then we can see how looks for a more comprehensive vocabulary list (and see if -lns
is worthwhile - if it really affects just a single word that is probably not worthwhile).
Testing on a larger list, there are more cases where if we do -ln
-> -l
then also doing -lns
-> -l
is useful, mostly nouns which happen to end -ln
- that means we now generate a stem which isn't linguistically correct for these, but that's not a problem in the intended domain of use, the only concern is if that introduces unwanted conflation, and it seems in practice it doesn't. Pushing changes to implement that.
Remove -stern
Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).
I had another look at this, and tried
diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index c0973c72..6de281f8 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -88,7 +88,11 @@ backwardmode (
( not 'syst' // don't remove -em from words ending -system
delete
)
- 'ern' 'er'
+ 'ern'
+ ( not ('st' R1) // don't remove -stern from morgenstern, etc
+ delete
+ )
+ 'er'
'erin' 'erinnen' // conflate female versions of nouns
( delete
)
This gives:
A total of 21 words changed stem
* 10 words changed stem but aren't interesting
1 merges of groups of stems:
{ morgenstern } + { morgensterne }
* 9 splits of groups of stems:
{ abend abende abendlichen abends | abendstern }
{ eisgeschwister | eisgeschwistern }
{ finster finstere finsteren finsteres | finstern }
{ geschwister | geschwistern }
{ höllengeister | höllengeistern }
{ leit leite leiten leiter leiterin leiters leitest | leitstern }
{ naturgeister | naturgeistern }
{ philister | philistern }
{ vorg | vorgestern }
Of those, morgenstern
, abendstern
and leitstern
are all something-star cases which seem minor improvements; vorg
doesn't seem to be a word; the other splits seem unhelpful. Also fenstern
is now conflated with fensternische
instead of with fenster
+fensters
, which seems slightly worse but reasonable.
Overall this doesn't seem a worthwhile change. (I tried the extra R1
check to try reduce unwanted changes but it's not really helping - without it there were about twice as many changes, but not worthwhile overall either.)
There's only one motivating example for this change, and no feedback when I asked for it some time ago now.
We could start an exception list of stems to not remove stern
from, but I'm not convinced they're common enough to really justify it. My current conclusion is not to try to address this one, but I'm open to discussion, especially if there are more examples which fall into a pattern.
The change to remove -ers
seems the wrong approach to me - the problem here is really that we're overstemming Förderer
and Förderern
- it would be better to stem those to forder
rather than the current ford
(which collides with a car brand), though I'm not sure how practical that is to resolve satisfactorily. Fighting overstemming with more overstemming seems problematic though.
@OlgaGuselnikova Do you have more examples of cases that the -ers
change helps?
Hello, Snowball developers team!
I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.
Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford
-erinnen is replaced with -erin
There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.
Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin
Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen
That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".
Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst
Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel
We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.
Standart suffix algorithms with described above changes
``` define standard_suffix as ( do ( [substring] R1 among( 'ers' ( delete ) ) ) do ( [substring] R1 among( 'erinnen' ( <- 'erin' ) 'em' 'ern' 'er' ( delete ) 'e' 'en' 'es' ( delete try (['s'] 'nis' delete) ) 's' ( s_ending delete ) ) ) do ( [substring] R1 among( 'stern' ( delete ) 'en' 'er' 'est' 'em' ( delete ) 'st' ( st_ending hop 3 delete ) ) ) do ( [substring] R2 among( 'end' 'ung' ( delete try (['ig'] not 'e' R2 delete) ) 'ig' 'ik' 'isch' ( not 'e' delete ) 'lich' 'heit' ( delete try ( ['er' or 'en'] R1 delete ) ) 'keit' ( delete try ( [substring] R2 among( 'lich' 'ig' ( delete ) ) ) ) ) ) do ( [substring] R1 among( 'ln' ( <- 'l' ) ) ) ) ```Thanks you for your time!