Closed epage closed 1 year ago
Bear in mind that the purpose of these stemmers is for use with Information Retrieval ("text search" in less formal terms) and for this what matters is that we map forms of a particular word onto the same output (and especially that we map forms of unrelated words to different outputs). That output tends to look like a word, and it often actually is the linguistic root, but that isn't actually a design goal - if you're wanting a library to reduce words to their linguistic roots then Snowball probably isn't the right answer because it doesn't actually aim to do that (and it seems from the linked ticket comment that this is what you were trying to use Snowball for).
This is arguably a bug, but the reason why is not that the stem should be add
as such, but rather that add
itself is left alone and that ad
is also the stem of ad
(short for advertisement
or advantage
, or in expressions from Latin such as "ad nauseam") so this conflates words with different meanings.
There are pretty much inevitably going to be such cases in a stemmer for a human language (and even without stemming languages are messy and ambiguous - e.g. in English ad
already has 3 quite different meanings which are conflated in search). Stemming is probably inevitably going to be imperfect, but it's good to approach such problems remembering that the end goal is improving retrieval which an imperfect stemmer can still do.
The underlying cause of added -> ad is the undoubling of consonants step - the candidates for this are: 'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
We can't remove 'dd' from that list as it almost always makes things worse: bed/bedding, nod/nodding, pad/padding, and numerous others. The only other case it seems to help I can see in voc.txt
is superadded/superadding which currently don't stem the same as superadd (but this a fairly obscure word and lacks the problematic conflation that add -> ad has).
Looking for other English words which have the form (one or two vowels)(double consonant from the list above)(optional 'ed' or' ing'), the only other cases in voc.txt seem to be ebb/ebbed/ebbing, err/erred/erring and off/offing.
Casting a wider net, I found eff/effed/effing, egg/egged/egging, and also some very obscure or archaic words.
We could potentially do something like not undouble if there's only one character before the double consonant. That'd help all the (one or two vowels)(double consonant from the list above)(optional 'ed' or' ing') cases noted above, but it makes ab/abbed (climber slang for abseiled) and up/upped/upping worse (currently these groups stem to ab and up respectively). Restricting it to that one character being a
, e
or o
seems to reduce the cases made worse to just ab/abbed which is probably a reasonable trade-off.
Implementing the "aeo" rule above causes these changes for the sample vocabulary which look good:
(I assume "erly" and "offe" are either typos or from some non-English or maybe Middle English text included in the source text this wordlist was generated from.)
See line 337 of the test data
or try this with PyStemmer:
This seems obvious to me which makes me wonder what I'm missing since I would assume in 21 years of development, this would have been noticed but I've not found any docs or anything in the issue tracker where this has come before.