morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
188 stars 44 forks source link

Autocorrection for stemming #7

Closed Rzulf closed 11 years ago

Rzulf commented 11 years ago

I used new (1.7.0) polish-stemmer package from maven and noticed that it doesn't fix diacritics, even though these options are true by default.

Here are simple unit tests I made http://pastebin.com/jwHSVecU

Another question is why "ą" is not replaced by "a" by default like "Ł" and "L"?

milekpl commented 11 years ago

The polish-stemmer package does not fix diacritics as it is not supposed to do so. It is the speller that does. You could use the stemmer dictionary for spelling, though this is discouraged, as stemmers usually contain more words than spellers (some rare words should not be accepted if they are confused, e.g., Polish 'sie' is confused with much more frequent 'się').

Rzulf commented 11 years ago

Thank for reply :) PS. transforming 'sie' to 'się' is exactly what I want to do, since 'się' is much more frequent.

milekpl commented 11 years ago

But to do this, you need a speller, not a stemmer. You could simply run a Speller on the words to be stemmed, and take the first suggestion. The Polish spelling dictionary to be used is available in LanguageTool repository: http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/languagetool/languagetool-language-modules/pl/src/main/resources/org/languagetool/resource/pl/hunspell/ (take .dict and .info files).