morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
187 stars 44 forks source link

Get all word varieties by world base #87

Closed gaffkins closed 7 years ago

gaffkins commented 7 years ago

Can I get all world varieties by word base?

dweiss commented 7 years ago

For Polish, yes. This is the main purpose of the morfologik-polish subproject.

gaffkins commented 7 years ago

With what method? Because lockup return only base world. I need all varities by base world. Example I write pies and what I expect is psy, psu, psem, psie...

dweiss commented 7 years ago

Short answer is: the same method, but different dictionary. https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-stemming/src/test/java/morfologik/stemming/DictionaryLookupTest.java#L164-L176

Morfologik doesn't ship with a dictionary for synthesis -- you'll have to invert the tagging dictionary or get the polish_synth dictionary from LanguageTool. See polish.README.Polish.txt

lukskar commented 7 years ago

Hey, this question is also relevant for me. At the moment I'm using polish_syth dict from LanguageTool and IStemmer.lookup method like this: iStemmer.lookup("<word>|<tag>") eg. iStemmer.lookup("niemiecki|adjp") will result in "niemiecku", if "adja" passed as a tag it will return "niemiecko" etc. Is there a way in which I can retrieve list of all possible varieties with single request to lookup method?

dweiss commented 7 years ago

You can look up a node corresponding to "niemiecki|" in the automaton and collect all the leaves starting from there. There are utilities to do this in a pretty simple way -- look at unit tests and grep the code, please.