snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

`english` algorithm says the stem of `added` is `ad` instead of `add` #182

Closed epage closed 1 year ago

epage commented 1 year ago

See line 337 of the test data

or try this with PyStemmer:

>> import Stemmer
>>> stemmer = Stemmer.Stemmer("english")
>>> stemmer.stemWord("added")
'ad'
>>>

This seems obvious to me which makes me wonder what I'm missing since I would assume in 21 years of development, this would have been noticed but I've not found any docs or anything in the issue tracker where this has come before.

ojwb commented 1 year ago

Bear in mind that the purpose of these stemmers is for use with Information Retrieval ("text search" in less formal terms) and for this what matters is that we map forms of a particular word onto the same output (and especially that we map forms of unrelated words to different outputs). That output tends to look like a word, and it often actually is the linguistic root, but that isn't actually a design goal - if you're wanting a library to reduce words to their linguistic roots then Snowball probably isn't the right answer because it doesn't actually aim to do that (and it seems from the linked ticket comment that this is what you were trying to use Snowball for).

This is arguably a bug, but the reason why is not that the stem should be add as such, but rather that add itself is left alone and that ad is also the stem of ad (short for advertisement or advantage, or in expressions from Latin such as "ad nauseam") so this conflates words with different meanings.

There are pretty much inevitably going to be such cases in a stemmer for a human language (and even without stemming languages are messy and ambiguous - e.g. in English ad already has 3 quite different meanings which are conflated in search). Stemming is probably inevitably going to be imperfect, but it's good to approach such problems remembering that the end goal is improving retrieval which an imperfect stemmer can still do.

The underlying cause of added -> ad is the undoubling of consonants step - the candidates for this are: 'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'

We can't remove 'dd' from that list as it almost always makes things worse: bed/bedding, nod/nodding, pad/padding, and numerous others. The only other case it seems to help I can see in voc.txt is superadded/superadding which currently don't stem the same as superadd (but this a fairly obscure word and lacks the problematic conflation that add -> ad has).

Looking for other English words which have the form (one or two vowels)(double consonant from the list above)(optional 'ed' or' ing'), the only other cases in voc.txt seem to be ebb/ebbed/ebbing, err/erred/erring and off/offing.

Casting a wider net, I found eff/effed/effing, egg/egged/egging, and also some very obscure or archaic words.

We could potentially do something like not undouble if there's only one character before the double consonant. That'd help all the (one or two vowels)(double consonant from the list above)(optional 'ed' or' ing') cases noted above, but it makes ab/abbed (climber slang for abseiled) and up/upped/upping worse (currently these groups stem to ab and up respectively). Restricting it to that one character being a, e or o seems to reduce the cases made worse to just ab/abbed which is probably a reasonable trade-off.

ojwb commented 1 year ago

Implementing the "aeo" rule above causes these changes for the sample vocabulary which look good:

compare

(I assume "erly" and "offe" are either typos or from some non-English or maybe Middle English text included in the source text this wordlist was generated from.)