Closed abratashov closed 1 year ago
@ojwb could you help to find the author of the Russian algorithm, it was added by Richard Boulton
but I'm not sure that he implemented it, thanks!
It's by Martin Porter (all the original Snowball algorithms were). If you have questions about the algorithm design or implementation, he reads the mailing list so best to ask there.
If you've not seen it already, the algorithm is described here: https://snowballstem.org/algorithms/russian/stemmer.html
That mostly just describes what it does, and doesn't say much about the reasons why design decisions were made.
Martin Porter answered:
The Russian stemmer was a collaboration of myself with Patrick Miles, a professional Russian translator.
The critical point is the note in the algorithm description,
"a tempting way of running the stemmer is to set a minimum stem length of zero, and thereby reduce to null all words which are made up entirely of suffix parts. We have been a little more cautious, and have insisted that a minimum stem contains one vowel."
and then, "RV is the region after the first vowel" and then "all tests take place in the the RV part of the word".
In всплыла, в,с,п,л are consonants and ы is a vowel, so RV just contains "ла". In the RV region we find ending ла therefore, but not ыла.
The algorithm makes an exception of вспл (surface) because it contains no vowel.
Perhaps the algorithm is in error here, but that is the reason for the result.
@ojwb thanks!
I'm sorry, possibly it's wrong place for such question, I'll move it to the appropriate place if it's needed... Currently, I'm trying to understand how Russian algorithm works, I've created separate branch and question on it https://github.com/abratashov/snowball/commit/3ff707d2da4104c144c6cd6eae2cb2c84b8f30db#r90635445
Currently, I'm exploring how Snowball works with the purpose to implement (improve existed) the Ukrainian algorithm. So, I've found that the word
всплыла
stemmed by snowball intoвсплыл
,but as I see it has to split as a verb
вспл|ыла
that is defined on line 150 https://github.com/abratashov/snowball/commit/3ff707d2da4104c144c6cd6eae2cb2c84b8f30db#diff-b960355da02c4da266b33015a78cea13802857c245e6d6ac959047efb6b44fbdR150Also, I've implemented testing word coverage for the Russian algorithm that could be run:
All are in this branch https://github.com/abratashov/snowball/compare/Add-test-coverage-for-russian-language
Thanks!