morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
188 stars 44 forks source link

Is it possible to create dictionaries for other languages #2

Closed redguy666 closed 11 years ago

redguy666 commented 11 years ago

Hi,

I have noticed that Czech and Slovak languages have quite poor stemming support in Solr. Only some basic heuristics and hunspell which is very slow in Solr 4.x. Would it be possible to prepare dictionaries similar to Polish one for that languages based for example on openoffice dictionaries? if so - how to achieve that?

dweiss commented 11 years ago

I'm sure it'd be possible to prepare such dictionaries. Marcin Miłkowski may have more insight into this. I'll let him know.

dweiss commented 11 years ago

Hi. Marcin says those dictionaries are in fact available as part of the LanguageTool project -- perhaps you can just take a look in there and reuse them?

http://www.languagetool.org/

redguy666 commented 11 years ago

I thought LanguageTool is rather a grammar checker, not stemming library... also - there is no Czech support, only Slovak (http://www.languagetool.org/languages/). Could you provide more information on how to use LanguageTool for stemming?

dweiss commented 11 years ago

Check out the source code -- there are FSA dictionaries for multiple languages (including Czech in one of the older versions I think). Marcin Miłkowski will know more details.

ragerri commented 11 years ago

Hi,

You will need to look at this to create a dictionary using morfologik

http://wiki.languagetool.org/developing-a-tagger-dictionary

To see how they use the morfologik stemming you will need to look at the LT code itself. They use Morfologik DictionaryLookup and IStemmer classes.

You can also ask in the LT lists.

Cheers,

Rodrigo

On Fri, May 17, 2013 at 10:56 AM, Maciej Lizewski notifications@github.comwrote:

I thought LanguageTool is rather a grammar checker, not stemming library... also - there is no Czech support, only Slovak ( http://www.languagetool.org/languages/). Could you provide more information on how to use LanguageTool for stemming?

— Reply to this email directly or view it on GitHubhttps://github.com/morfologik/morfologik-stemming/issues/2#issuecomment-18053016 .

redguy666 commented 11 years ago

thanks for your hints. will try that out.

milekpl commented 11 years ago

@redguy666: there is a Czech dictionary although the support for Czech is not advertised. The reason is that we only have a dictionary, and a big one in that. I'm not sure where the file is in our Maven repo right now but here's the old location:

http://svn.code.sf.net/p/languagetool/code/tags/V_1_9/src/main/resources/org/languagetool/resource/cs/

redguy666 commented 11 years ago

it worked :) at least for slovak language (there is no czech dictionary in languagetool). I created universal MrofologikStemmer filter for Solr - it accepts dictionary name as parameter instead of DICTIONARY enumeration element so you can use it for any dictionary from LanguageTool.

redguy666 commented 11 years ago

@milekpl - sorry, I missed your last comment. Thanks for the link to Czech dictionary!