snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
741 stars 173 forks source link

Swedish stemmer #47

Open Benjaminsson opened 7 years ago

Benjaminsson commented 7 years ago

I think i have found a bug in the swedish stemmer. When searching for "mötet" (the meeting) i should get result for "möte" and "möten". I think the problem is when stemming words ending with "et". (words ending with "andet" and "het" should work though. Those endings are in the suffix list.

When searching for the longest suffix in the first step i added this suffix "et" and that works. Don't know if that is the right way to fix this though.

ojwb commented 5 years ago

Sorry it's taken ages for anyone to respond. I'm not familiar with Swedish, but have been looking into this.

A point I should probably make first is that there are inevitably some trade-offs between under- and over-stemming - -et not being removed here is understemming, but removing -et from other words where it isn't the definite article suffix would be overstemming. And when overstemming leads to unrelated words being mapped to the same stem it's arguably worse than understemming.

http://snowballstem.org/algorithms/scandinavian.html notes:

the definite article (the in English, der etc in German) there corresponds a noun ending in the Scandinavian languages. This ending cannot always be removed with certainty. In Swedish, for example, the en form is removed, but not the t or n form,

That doesn't explicitly mention "et", but it says "for example" so the list probably isn't exhaustive.

So while adding suffix "et" helps this particular case, the key question is really whether it does more harm than good overall, which I haven't really reached a conclusion on.

Are there Swedish words which happen to end "et" where the "et" isn't the definite article?

znakeeye commented 2 years ago

Another three years later 😛

I'm actively investigating the correctness of the Swedish stemmer, so maybe I can be of help. Yes, there are a bunch of words that end with et. E.g.:

"paket" (package) - definitive form: "paketet" (plural "paketen")
"piket" (package) - definitive form: "piketen" (plural "piketer")
"raket" (rocket) - definitive form: "raketen" (plural "raketerna")
"staket" (fence) - definitive form: "staketet" (plural "staketen")
"trumpet" (trumpet) - definitive form: "trumpeten" (plural "trumpeterna")

I don't think you can easily define a rule here. E.g. we have:

"taket" - definitive form of "tak" (roof)
"riket" - definitive form of "rike" (kingdom)

So indeed, we have to choose between under- and over-stemming.

ojwb commented 2 years ago

Thanks for the useful input. If the choice is under- or over-stemming, our bias is generally towards under as that is usually less problematic.

Two possible options for removing this ending in at least some cases come to mind:

Otherwise it sounds to me like we maybe can't do better here, and perhaps should just add et to the list of example in the text I quoted above (and probably also put that note into the page specifically about the Swedish stemmer).

znakeeye commented 2 years ago

Thanks for good pointers!

Words ending in etet and eten are extremely rare, and almost all of those two can be interchanged. E.g. medveten vs medvetet. The stem here should be medvet, right? Similar to medvetande.

The only outsiders I can think of is societet (like you mentioned) and varietet. For these words, I would actually argue that the stem is in fact societ and variet since the et is often replaced by é.

This seems like a significant improvement. There are many words ending with et (e.g. itet is very common).

ojwb commented 2 years ago

The stem here should be medvet, right? Similar to medvetande.

Well, a relevant point here is that we aren't aiming to implement lemmatisation - what actually matters isn't that we accurately map words to their root form, but rather that we conflate words with a common meaning onto the same string, and words with different meanings onto different strings. That string (the stem) often looks like a word, and may actually be the root form quite often, but that's not a requirement.

Currently medveten and medvetande both stem to medvet, and wiktionary.org's definitions indicate that's good, so medvetet stemming to medvet too sounds like an improvement.

That also means it's OK for societet and varietet to lose the last et provided that doesn't collide with unrelated words and other forms of these words end up with the same stem (at least to the extent they're currently conflated).

I've opened PRs for both snowball and snowball-data (both auto-linked above) with a draft change which special cases -etet and -eten. While admittedly I don't know any useful amount of Swedish, the changes in expected output in the snowball-data PR look very sensible to me.

Assuming we go with this, the website description needs updating too - I'm happy to do that, but will wait to see if there's some problem with this change, or if there are further changes in a similar vein we could make.

I handled eten as well as etet, though that only affects a single word in our swedish/voc.txt - medvetenhet which now stems to medvet instead of medveten.

znakeeye commented 2 years ago

Thanks for clarification. Conflating words with a common meaning, got it!

I'm convinved there is a general rule here. To begin with, we can extend the et prefix to e[gmtv]. This implies correct stemming of some additional words. E.g.:

In fact, we can improve this even more. All words ending with uten have an utet form. The same goes for any other vowel. Also, other consonants than t will follow the same pattern.

THE GENERAL RULE Using exhaustive search and my intuitive understanding of the Swedish language, this is what I have come up with:

For any word ending with vowel consonant e[nt] we can remove the e[nt] suffix.

If you, using your knowledge, can challenge this statement that would be great. E.g. "Find a word that fulfills condition x and y!" Exactly what should we look for? I can be of help.

Given that this rule has no conflicts, the stemmer will be able to correctly conflate thousands of new words.

ojwb commented 2 years ago

Here's the list of words in swedish/voc.txt that stem differently with this rule in the same place #154 changes (this is compared to the version currently in #154 with 'et' instead of vowel consonant:

vowel-consonant-et-or-en.delta.txt

I skimmed through the output changes starting a-f so far, and the only one which seems problematic at all there is:

för -> för
föra -> för
föras -> för
före -> för
fören -> för
förena -> fören                                      för
förenad -> fören                                     för
förenade -> fören                                    för
förenande -> fören                                   för
förenar -> fören                                     för
förenas -> fören                                     för
förenat -> fören                                     för
förer -> för
förlig -> för

This is <voc entry> -> <output entry> for the baseline with the new output entry to the right where different (and the baseline is the code as in current #154).

This is falsely conflating forms of för (meaning the "bow of a ship" apparently) and forms of förenad (meaning "united", apparently), though some forms were already being conflated so it also means we're now conflating some forms which should be. Overall this doesn't really seem better or worse for these cases.

I'll look over the rest of the differences.

ojwb commented 2 years ago

Actually I think there isn't any existing conflation here (I was confusing fören as an input and fören as a stem), and so this is a case that's made worse.

znakeeye commented 2 years ago

Ok. för means something like fore in this case. Like foresee.

före+n is a very rare construct. It took me a while to realize it wasn't för+en. Which one got worse? Trying to follow the snowball speak here :)

ojwb commented 2 years ago

Perhaps this makes it clearer:

https://snowballstem.org/demo.html?text=f%C3%B6r%0af%C3%B6ren%0af%C3%B6rar%0af%C3%B6rarna%0af%C3%B6rs%0af%C3%B6rens%0af%C3%B6rars%0af%C3%B6rarnas%0a%0af%C3%B6rena%0af%C3%B6renas%0af%C3%B6renat%0af%C3%B6renats%0af%C3%B6rena%0af%C3%B6renen%0af%C3%B6renar%0af%C3%B6renade%0af%C3%B6renas%0af%C3%B6renades%0af%C3%B6rena%0af%C3%B6renade%0af%C3%B6renas%0af%C3%B6renades%0af%C3%B6rene%0af%C3%B6renade%0af%C3%B6renes%0af%C3%B6renades%0af%C3%B6renande%0af%C3%B6renad#Swedish

The first group are (at least according to wiktionary) declensions of the noun för, and currently all but one is stemmed to för.

The second group are (again according to wiktionary) conjugations of the verb förena, and currently all but one is stemmed to fören.

With the vowel consonant e[nt] change, everything that currently stems to fören in the second group would instead stem to för.

I think the issue here is essentially that removing the extra en in this case collides with a different word.

ojwb commented 2 years ago

Another one: forms of lägenhet (apartment) currently stem to lägen but would stem to läg, which forms of läge (situation) stem to.

ojwb commented 2 years ago

Forms of planet (meaning "planet") such as planeter used to stem to planet but now stem to plan. Now planet itself is ambiguous as it's also the definite singular form of plan (so could mean "the plan") so for planet itself this may be reasonable, but it's less helpful for planeter, etc as they clearly mean "planet" not "plan".

BTW, these cases probably don't sink the idea as it does seem very promising (I think it's probably hundreds rather than thousands of cases where it makes a positive difference, but that's still a lot).

I'm trying to gather cases where it might be problematic to see if the rule could be adjusted to avoid them, or else we could add a short list of exceptions where it shouldn't be applied.

Also, what's the intuition that lead you to think that would be a good check? It would be useful to document why this rule was chosen.

znakeeye commented 2 years ago

Most verbs end with a and many of these can be transformed into adjectives by replacing the a with either en or et.

I didn't realize the stemmer "loops" like you described. If "förena" becomes "fören" and then - in a second loop? - "för", I would say the "general rule" should then only apply the first run.

Will try to come up with a better rule.

znakeeye commented 2 years ago

So what we are looking for are pairs of substantives that end with vowel consonant with or without that trailing e. The definite form of these pairs could then "collide". Such pairs will not be super common (I found some 50-70), but still a significant problem.

Will try to improve the rule to handle these cases, but from a quick glance it looks like it will be impossible without a list of exceptions. Is it reasonable to have such a long list of exceptions?

znakeeye commented 2 years ago

On the other hand, when words like före+n and för+en are incorrectly grouped together (with our without stemming), that is precisely what a human would do too. Without context and semantic analysis it's impossible to know which word we are referring.

In my opinion it is up to the API consumer to distinguish these words/stems using grammatical analysis.

But as previously mentioned, the suggested rule should only apply when "removing the first stem" - the first run (or whatever you call it 😛). That way lägenhet would be correctly stemmed as lägen.

znakeeye commented 2 years ago

Let's try this then:

For any word ending with vowel consonant e[nt] we can remove the e[nt] suffix if the suffix is NOT preceded by any of the following:

bal bet byk båg för gag gal gam gar gås hak huv kas kat kav ked kop kyl käk kål kår köt lav lev
lik lim lov löp mod nap nar nog nyt pad pek pil rat rav red rep res riv rot ryn räd råg råk sel
säd såt teg tor tät val vas vep vis

Those 57 prefixes should handle all cases. I removed some prefix candidates, since I determined that their matching words have similar meaning, etc. It should be noted that most of the matching words have a length of 4 or 5 letters - e.g. "löpe". Perhaps they would be automatically handled by that R1 stuff? That would allow us to reduce the list of exceptions even more. Thoughts?

Documentation

Definitive forms of substantives are often constructed with -en or -et extensions. In most cases, the extension can safely be removed. However, we need to have special handling of substantives ending with an e, where its absence would produce another substantive. Using a list of known prefixes we can avoid conflating such words.

Removed prefixes

Words with similar meaning. E.g. "hag" and "hage" are equal, "bus" (mischeif or crime) and "buse" (ruffian) are related. bus dam eal fån hag lak

Reasonable trade-off. E.g. "talar+en" is one crazy-rare word. However, "talare+n" is very common. lar

ojwb commented 2 years ago

Two more:

raket (rocket) and rak (straight)

staket (fence) would get conflated with staken (candlestick); staket and staketet (the fence) would no longer conflated.

ojwb commented 2 years ago

There's no looping, just 3 steps applied in turn - see https://snowballstem.org/algorithms/swedish/stemmer.html for an English description of the current algorithm.

I think the issue with lägenhet is largely due to where I implemented the etet -> et and eten -> et rules, which I chose as that gave the desired results - I adjusted this to test the vowel consonant e[nt] rule and left it at the same place, since this widened rule was still handling the original two cases. Perhaps that could be done elsewhere, but it's complicated by this being a deliberate attempt to remove et even when it is really part of the root of the word (i.e. we're deliberately making our stem be the actual root form but without et at the end).

I wonder if you're aiming too close to perfection here, when any solution will inevitably be imperfect because human languages aren't cleanly designed (for example, as you noted it is impossible to tell if fören is för+en or före+n without context the stemmer doesn't have).

Fundamentally, these algorithms are intended to be used in text search applications to improve results - in general they'll improve recall at the expense of some loss of precision (as defined by https://en.wikipedia.org/wiki/Precision_and_recall#Definition_(information_retrieval_context)), and that trade-off is pretty much inherent since the various forms being conflated will tend to carry at least slight differences in meaning.

With that in mind, understemming can be viewed as simply not giving up some precision to improve recall in certain cases. Understemming doesn't make a stemmer useless, since a stemmer with a lighter touch will still give improved recall over not using a stemmer (and with better precision than a mythical perfect stemmer!)

Overstemming is more problematic because it worsens precision without improving recall. It also leads to documents matching a search without any clear reason, which is confusing for users.

Based on this, I'd advocate for finding simple rules that handle common cases and overstem rarely. Rules that make sense from the grammar are more satisfactory than ad-hoc patterns that just seem to work.

Having a rule with 50+ exceptions seems too many to me (and your exception list doesn't appear to cover some of the cases I noted above either). I don't think R1 helps cull any, e.g for löpe R1 starts after the p so löpen would be considered. If we're really going to take this approach, I think we need to not make this a list of apparently arbitrary exceptions, and annotate each exception with a comment as to why it's there, since otherwise we're failing to record information that's useful for maintaining the code. Without knowing why an exception was included we can't usefully evaluate a future suggestion to remove it based on it firing for a word it seems should be stemmed.

znakeeye commented 2 years ago

Got it.

Now...

  • Similarly that's not really an option if there are words ending etet which aren't definitive forms, again unless they're really rare. Our 30623 entry swedish/voc.txt has 11 such words:

Unfortunately, there are several hundreds such words. Just to mention a few:

alfabetet
anletet
betet
dekretet
envetet
epitetet
gletet
gnetet
headsetet
kletet
konkretet
metet
paketet
petet
rågvetet
sekretet
setet
sketet
smetet
stretet
trumsetet
varietet
vetet

Is this a problem?

ojwb commented 10 months ago

Sorry I've not managed to get back to this before - there's a lot of info to get back in my head and I've not found a suitable time to. I've just been looking into this again and I think a key thing is exactly where to put this removal. My PR patch picked a somewhat arbitrary point that seemed to work, but it seems that may not be ideal for an expanded version.

But as previously mentioned, the suggested rule should only apply when "removing the first stem" - the first run (or whatever you call it 😛). That way lägenhet would be correctly stemmed as lägen.

That would seem to mean slotting this into the main_suffix routine. I've just looked at doing that, but one problem is that already removes -en with the only condition being that it's in R1 so I'm not sure how to reconcile that with adding your new conditional check on removing -en or -et. I tried removing the unconditional -en to do this and that seems to give worse results (e.g. "abfallen" no longer gets stemmed so isn't conflated with "abfall").

Maybe you were suggesting doing this as a separate new step before main_suffix? I just tried that and it seems problematic - e.g. currently these all stem to abgeleitet:

abgeleitet  
abgeleitete 
abgeleiteten
abgeleitetes

With your "vowel consonant e[nt]" with 57 exceptions rule done as a first step, abgeleitet stems to abgeleit while the rest are unchanged (if done as a separate step after main_suffix then these all stem to abgeleit which seems OK).

ojwb commented 10 months ago

Hang on, I just realised I was unintentionally testing with the German voc.txt not the Swedish one (I reused a command line from history and failed to notice the exact path). So I need to retest but it'll probably need to be tomorrow.