opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.47k stars 1.74k forks source link

Built-in support for German words decompounder #1279

Open quangmaubach opened 2 years ago

quangmaubach commented 2 years ago
name about title labels assignees
💭 Proposal Better support for German words decompounder [PROPOSAL] proposal

What kind of business use case are you trying to solve? What are your requirements?

We would like to have an out-of-the-box solution for German words decompounder, where OpenSearch have some analyzer that we can simply use it, similar to the level of support that Japanese as a language has (with kuromoji analysis plugin).

What is the problem? What is preventing you from meeting the requirements? Many German words are actually a combination of shorter words, for example Fallschirmspringerschule (Skydiving school) = Fallschirm (parachute) + spring (jump) + schule (school). To get good quality search results, we need a decompounder to "break" words into components, to search by each individual word.

Currently, there is some support for Germanic language decompounder in Lucene and OpenSearch (for example, documentation taken from Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-hyp-decomp-tokenfilter.html and https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-dict-decomp-tokenfilter.html), however, we still need to find a good dictionary file to effectively use the token filters.

There are some effort to build such dictionary, for example, at https://github.com/uschindler/german-decompounder

Another effort to solve the same problem that we can find, is to write a prebuilt Trie into a plugin, for example, at https://github.com/jprante/elasticsearch-analysis-decompound

The problem attracts good level of interest, when searching on search engine. For example, in this SO post https://stackoverflow.com/questions/59595689/elasticsearch-handling-german-compound-words, both of the approaches above are mentioned.

I think a better language support for German will be beneficial for OpenSearch users. We will not need to reinvent the wheel or reconstruct/look for a dictionary (with questionable accuracy).

What are you proposing? What do you suggest we do to solve the problem or improve the existing situation? I do not have any specific solution in mind.

What are your assumptions or prerequisites? I assume that there is demand for such German decompounder token filter and users are not satisfied with existing solution (or not being able to install plugin onto OpenSearch)

What are remaining open questions?

dblock commented 2 years ago

Thanks for writing this up! Moving this into OpenSearch, please open new issues there for the core engine.

quangmaubach commented 2 years ago

Hi team, is there any guidance/suggestion for this issue? Should I start working on some PR for this?

dblock commented 2 years ago

Hi team, is there any guidance/suggestion for this issue? Should I start working on some PR for this?

No need to ask for permission! Comment here if you're working on it so we don't duplicate work.

anasalkouz commented 2 years ago

Hi @quangmaubach, are you still working on this?

avidanov commented 9 months ago

I've addressed a similar requirement in my recent blog: "Enhancing German Search in Amazon OpenSearch Service." It dives into effectively handling German compound words using dictionary_decompounder and synonym filters, enhancing search accuracy significantly.

Key Points:

Check out the blog for a detailed walkthrough: [Enhancing German Search in Amazon OpenSearch Service]

While the blog doesn't offer a plug-and-play solution, it provides insights and a potential pathway for enhancing German language support in OpenSearch, considering the complexities of German compound words.