snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Understanding Snowball: why word всплыла not recognized as verb in Russian algorithm #173

Closed abratashov closed 1 year ago

abratashov commented 1 year ago

I'm sorry, possibly it's wrong place for such question, I'll move it to the appropriate place if it's needed... Currently, I'm trying to understand how Russian algorithm works, I've created separate branch and question on it https://github.com/abratashov/snowball/commit/3ff707d2da4104c144c6cd6eae2cb2c84b8f30db#r90635445

Currently, I'm exploring how Snowball works with the purpose to implement (improve existed) the Ukrainian algorithm. So, I've found that the word всплыла stemmed by snowball into всплыл,

echo "всплыла" | ./stemwords -l ru
=>
всплыл

but as I see it has to split as a verb вспл|ыла that is defined on line 150 https://github.com/abratashov/snowball/commit/3ff707d2da4104c144c6cd6eae2cb2c84b8f30db#diff-b960355da02c4da266b33015a78cea13802857c245e6d6ac959047efb6b44fbdR150

[Step 1]
remove(perfective_gerund)                   # всплыла ? -> No
OR
  remove!(reflexive)                        # всплыла ? -> No
  remove([adjective, adjectival])           # всплыла ? -> No
  OR
  remove(verb)                              # вспл|ыла ? -> Yes -> вспл !!But Snowball skips it!! Why?!

Also, I've implemented testing word coverage for the Russian algorithm that could be run:

ruby ./tests/algorithms/russian_test.rb

All are in this branch https://github.com/abratashov/snowball/compare/Add-test-coverage-for-russian-language

Thanks!

abratashov commented 1 year ago

@ojwb could you help to find the author of the Russian algorithm, it was added by Richard Boulton but I'm not sure that he implemented it, thanks!

ojwb commented 1 year ago

It's by Martin Porter (all the original Snowball algorithms were). If you have questions about the algorithm design or implementation, he reads the mailing list so best to ask there.

If you've not seen it already, the algorithm is described here: https://snowballstem.org/algorithms/russian/stemmer.html

That mostly just describes what it does, and doesn't say much about the reasons why design decisions were made.

abratashov commented 1 year ago

Martin Porter answered:

The Russian stemmer was a collaboration of myself with Patrick Miles, a professional Russian translator.

The critical point is the note in the algorithm description,

"a tempting way of running the stemmer is to set a minimum stem length of zero, and thereby reduce to null all words which are made up entirely of suffix parts. We have been a little more cautious, and have insisted that a minimum stem contains one vowel."

and then, "RV is the region after the first vowel" and then "all tests take place in the the RV part of the word".

In всплыла, в,с,п,л are consonants and ы is a vowel, so RV just contains "ла". In the RV region we find ending ла therefore, but not ыла.

The algorithm makes an exception of вспл (surface) because it contains no vowel.

Perhaps the algorithm is in error here, but that is the reason for the result.

abratashov commented 1 year ago

@ojwb thanks!