snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

Add ukrainian stemmer #178

Open abratashov opened 1 year ago

ojwb commented 11 months ago

@abratashov Have you finished working on this? It seems additional changes get pushed from time to time, and I can still see commented out code and questions in comments in the .sbl file...

Also, can you clarify how this relates to the Ukrainian stemmer in #144?

It seems they've been separately developed, both starting from the Snowball Russian stemming algorithm.

The original author of the code in #144 made some comments about it - notably that it doesn't try to remove prefixes (as best I can tell yours doesn't either?), and uses a cruder length check than the usual Snowball R1/R2/RV approach which the Russian stemmer and yours use.

Comparing output on the sample vocabulary from https://github.com/snowballstem/snowball-data/pull/18 I can see quite a few cases which the older submission appears to handle better (I can't read Ukrainian though, so maybe these are incorrect conflation of similar words with different meanings), e.g. here's an annotated screenshot with your stemmer on the right:

ukrainian stemmer output comparison

I've marked in green vs red where it looks to me like one stemmer is doing a better job.

In this screenful there's one word where yours seems better, but the other stemmer seems better overall. This varies as I page through the file, but if I had to pick the stemmer from #144 seems like it's a bit better. However I should reiterate that's an impression I've formed without any knowledge of what the words I'm looking at actually mean!

One likely flaw I spotted with the other stemmer is it can reduce words to a single letter, which is not necessarily always wrong, but is liable to conflate unrelated words given there are only 33 possible single letter stems - I suspect that's a result of using an initial length check instead of restricting removal to suffixes in R1/R2.

abratashov commented 11 months ago

@ojwb thanks for your checks on this PR, yes I'm polishing it!

With the help of other guys from Ukraine and the international community, this year I've dived deeper into the Snowball stemmer and this area at all.

Currently, this PR contains the latest version of UA stemmer and some dev tools that facilitate development (utf <=> sbl converter), as well as some files with test words.

In the near future, I'm exploring this stemmer https://github.com/snowballstem/snowball/pull/144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

Main questions: 1) What PR should look like? Should it be the only one ukrainian.sbl file? 2) How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha 3) Where should I keep test sets of words (.txt, .yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

Tapkomet commented 11 months ago

@ojwb thanks for your checks on this PR, yes I'm polishing it!

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

2. How to estimate the quality of stemmer? Are there any tools for that? 

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

I believe I can help a bit with questions 2 and 3. When I worked on this, I built a Java project - I believe there are instructions on how to do it on the Snowball website. IIRC I had to rebuild it whenever I made edits to the .sbl file. (I should note that the project would come out slightly wrong, with incorrectly set imports, but when I fixed that it would be workable).

Afterwards, I simply had a text file in the project folder with a bunch of Ukrainian text (I copy-pasted a bunch of Ukrainian Wikipedia articles into the file as source material), and the program would output the results to a results text file.

For measuring output of the stemmer, I would simply go through a significant amount of results at random (like a hundred or two) and tally up the number of errors. Obviously I had to judge by myself what was an error and what wasn't, so it was subjective in some cases.

If you want to see examples, I am attaching the txt file containing source text, and the results file. The results file pairs each stemmed word with its original form (first stemmed, then original), e.g. авторств авторство

testUkrainian.txt Results.txt

ojwb commented 11 months ago

In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

144 is the "UAStemming" code with one change - it uses the newer {U+nnnn} notation for Unicode codepoints instead of hex nnnn (the way hex is specified means you need a modified version of the Snowball source to support single byte character sets, whereas the newer syntax allows us to have a single version of the source of each algorithm - I don't know if KOI8-U is still relevant, but if it were it would help for that).

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

This is detailed in CONTRIBUTING.rst, but essentially just the new file and an update to modules.txt. Everything should automatically work from that.

Test coverage is provided via the data files in snowball-data (which make check, make check_java etc in snowball will use automatically), which are in a separate repo as they're much larger than code itself. These provide test coverage for all languages Snowball can generate code for so are a better approach than writing test scripts in a particular languages, which would need writing 9 times, and any update applying in 9 places.

Please keep each PR to one purpose - make dev tools, etc their own PR(s). Reviewing a larger PR is harder and takes longer, and everything ends up blocked by a blocker in one part.

2. How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha

Looking at the output of ./stemtest -l ukrainian -p2 < some-ukrainian-word-list.txt gives an idea (the screenshot above is just that output for the two stemmers compared in vimdiff). We don't have anything more sophisticated.

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

snowball-data (again, read CONTRIBUTING.rst).

There's a wordlist extracted from Ukrainian wikipedia in https://github.com/snowballstem/snowball-data/pull/22 (I think the submitter closed it after realising the algorithm had already been submitted, but the earlier submission had a wordlist that seems much too short so I'd suggest this one unless you have a better one which is suitably licensed).

abratashov commented 11 months ago

Now everything is clear, thanks for the answers, will do it!

ojwb commented 11 months ago

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

This is now in the snowball-data repo as scripts/stemmer-compare - you might find it useful for evaluating potential changes you're considering making to the algorithm.

It takes a vocabulary list and two output files with stemmed versions and attempts to describe the changes. It can spot and describe some simple cases of merged or split groups of stems, and some cases where a stem moves between groups. Testing so far suggests it does better than I'd hoped for evaluating small tweaks to an algorithm, but it does less well for comparing "porter" vs "english" (where the latter evolved from the former) and isn't really useful for "dutch" vs "kraaij_pohlmann" (which are two separately developed Dutch stemming algorithms). It'll likely improve with time.

Sample excerpts of output for a recent tweak to the swedish stemmer:

A total of 342 words changed stem

* 273 words changed stem but aren't interesting:
  altröst, amitiöst, anderöster, andraröster, [...]

* 53 merges of groups of stems:
  { ambitiöst } + { ambitiös, ambitiösa, ambitiösare, ambitiösaste, ambitiöse }
  { amoröst } + { amorös, amorösa, amoröse }
  { avlöst, avlösta, avlöste, avlöstes, avlösts } + { avlösa, avlösande, avlösare, avlösas, avlöser, avlöses }
[...]
abratashov commented 9 months ago

@ojwb I've updated the current stemmer with new rules, also opened PR with test words https://github.com/snowballstem/snowball-data/pull/24

I hope during next month I'll polish it to a production-ready release!