jprante / elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
GNU General Public License v2.0
87 stars 38 forks source link

Controlling decomposition #24

Closed paweloque closed 8 years ago

paweloque commented 8 years ago

I'd like to be able to use a dictionary based approach to controll which words will not be decomposed. Something similar like: https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html

The words in a dictionary will not be decomposed by the plugin and will only produce the original token as output.

Example: I'm indexing product data and merchant information. Some of the words are merchant names like: Interdiscount. I want to be able to control the decomposition plugin by providing a dictionary with words that must not be decomposed.

jprante commented 8 years ago

Thanks for the example. I think the real solution is to implement entity name recognition. For the start, a simple small list in the config would do it. But, for the general case, a real solution would be to phrase detected names, maybe with help of Wikidata dumps, and mark them as entities, which can not be decomposed at all.

paweloque commented 8 years ago

Which words have to be split really depends on the domain. Therefore I think that having this configurable by the user is a better way than an automatic entity detection. And also if new words come up, you might not find them in common dictionaries.

paweloque commented 8 years ago

Only now I've discovered the "respect_keywords": true parameter. Together with the keyword_marker filter I can exclude keywords from decomposition. So my requirement is already covered.