Add option to skip abbreviations

apohllo commented 7 years ago

Morfologik dictionary is used both in SOLR and Elasticsearch. The problem is that in the compilation that is available in the Maven repositories, the dictionary has many abbreviations that are at least problematic in the context of text retrieval. The most striking example is the letter "w" which is expanded as "wiek". As a side effect, when the user enters "wiek" as the search query, all documents that include "w" will be retrieved.

I would suggest providing an abbreviation-less compilation of the dictionary or at least provide an option to compile the dictionary without abbreviations.

dweiss commented 7 years ago

I think you shouldn't expand a single-letter token... Anyway, morfologik returns multiple tokens and their tags, not just one, and it does need to include all potential word forms (and their tags) for applications other than information retrieval. If you wish to disambiguate at the dictionary level (it won't always be possible or sensible) then you can recompile your own dictionary and inform Solr (or ES) about which resource to load (Solr definitely has an option to override the default dictionary, don't know about ES).

apohllo commented 7 years ago

Well, for me this is not a problem in fact, since I am aware of the pipeline that is used by SOLR and ES. Thus I have already patched my own copy of the dictionary. But I think there are many people that are totally unaware of the way Morfologik works under the hood and are very surprised with such results. They just add a dependency in Maven and don't want to touch any internals. Providing a tailored dictionary for text retrieval (which I believe is the most popular use-case for Morfologik - at least for people not doing any NLP research) would be convenient.

So, if this is not a problem for the authors, I hope you won't be angry if such tailored versions will be uploaded to some public Maven repository (with appropriate library name change)?

dweiss commented 7 years ago

The license for Morfologik is BSD, so you can do this. But I honestly don't think your "tailored" version would be any better than the default one. What seems to be better for you wouldn't be a good fit for others. There will always be some surprising results, whether they're skewed in one way or another. Perhaps the best way to fix the problem would be to provide a patch to Solr/ES that would explain how Morfologik "stemming" works and what needs to be done to avoid unexpected results (this also depends on how you configure analyzers chain, for example).

Or, even better, provide a patch on top of Morfologik that provides some confidence value for each token; then they can be sorted and people can index just the common meaning(s) of a given term. This will still yield some problems, but would be a nice improvement to get rid of the examples you mentioned (since they seem pretty rare).

Note that Morfologik wasn't created for information retrieval. Sure, we're glad it's helpful there, but the objective of the library is to provide a possibly complete morphosyntactical description of a wide range of Polish tokens.

morfologik / polimorfologik

Add option to skip abbreviations #13