snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

Is it normal that comparatives and superlatives are not stemmed? #172

Open raffaem opened 2 years ago

raffaem commented 2 years ago
>>> import Stemmer
>>> stemmer = Stemmer.Stemmer('english')
>>> print(stemmer.stemWord('poorer'))
poorer
>>> print(stemmer.stemWord('cleaner'))
cleaner
>>> print(stemmer.stemWord('cleanest'))
cleanest
ojwb commented 1 year ago

I've moved this ticket because the question here is really about the code in the snowball repo (pystemmer is just a thin wrapper layer on top of this).

This point doesn't seem to be explicitly covered in the algorithm documentation on the website, but I think these aren't done because the obvious rules for them would also trigger in cases where they'd be harmful. In the intended domain of use (generating index terms for information retrieval) overstemming (at least when it causes collisions between unrelated words) is much more problematic than understemming, so we tend to err on the side of understemming in such cases.

For example, tempest and temper would both be reduced to temp (and would also collide with temp meaning a temporary employee), wither would collide with with, etc.

There is actually a rule to remove an -er suffix, but only in R2 (https://snowballstem.org/texts/r1r2.html). This means some longer superlatives are actually handled (e.g. yellower) as well as conflating observer with observe, observed, observes, observing, etc.

If we were to add est where er is handled, there is the odd problematic case - e.g. interest -> inter (colliding with inter meaning to bury) and similarly disinterest -> disinter, but it mostly seems helpful so maybe that's worth considering.

ojwb commented 10 months ago

I had a look for past discussion of this and found Martin posted about -est in https://lists.tartarus.org/pipermail/snowball-discuss/2003-December/000548.html (just under 20 years ago!):

Yes -est is not removed, there being too many words from which its removal would be incorrect - behest, attest, request and so on. Removing -est from longer words only is not really satisfactory, since the English comparitive and superlative endings are only added to short adjectives anyway: "curioser and curioser" is not, of course, correct English.

Similar comments also from Martin slightly more recently in https://lists.tartarus.org/pipermail/snowball-discuss/2009-November/001137.html :

Comparatives and superlatives in English have too many exceptions for them to be usefully put into a general rule for suffix removal. Think of,

winter center after aether elder ...

divest detest digest attest ...

The usual way to to 'soften' a rule like this in the Porter stemmer is to make it applicable to longer words only -- typically, those that have at least a two syllable stem. But the problem there is that in English the comparative and superlative endings are only added to short adjectives anyway. So we have bigger, larger, fatter, but not giganticer, immenser, enormouser.

The problem can only be solved by building up special word lists of adjectives that can take these endings.

It's true that these comparative and superlative endings are only used on shorter adjectives for which it seems impossible to implement a rule which isn't just a huge list of such adjectives, but there are a number of cases where this "short" overlaps with Snowball's R2 so doing that still seems worth considering even though it only addresses a minority of cases.

Here's an analysis of the changes for the sample vocabulary for removing -est in R2 (adding 'est' where 'er' is handled):

ojwb commented 1 month ago

Another thing I notice from this is that we probably don't want to remove -est if we already removed an ending

Looking at this again, I'm not so convinced that's the right conclusion - we really don't want to remove est from interest (because then it collides with inter) and it's liguistically wrong though not problematic to remove it from manifest - if we get that part right the rest works fine. Also true for undigest but that's a very rare word so its handling matters rather less.

I looked at the slightly more restricted change of removing est in step 4 unless preceded by er, f, or g (which empirically seems to exclude the problematic cases):