Closed paweloque closed 8 years ago
Thanks for the example. I think the real solution is to implement entity name recognition. For the start, a simple small list in the config would do it. But, for the general case, a real solution would be to phrase detected names, maybe with help of Wikidata dumps, and mark them as entities, which can not be decomposed at all.
Which words have to be split really depends on the domain. Therefore I think that having this configurable by the user is a better way than an automatic entity detection. And also if new words come up, you might not find them in common dictionaries.
Only now I've discovered the "respect_keywords": true
parameter. Together with the keyword_marker filter I can exclude keywords from decomposition. So my requirement is already covered.
I'd like to be able to use a dictionary based approach to controll which words will not be decomposed. Something similar like: https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-stemming.html
The words in a dictionary will not be decomposed by the plugin and will only produce the original token as output.
Example: I'm indexing product data and merchant information. Some of the words are merchant names like:
Interdiscount
. I want to be able to control the decomposition plugin by providing a dictionary with words that must not be decomposed.