jprante / elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
GNU General Public License v2.0
87 stars 38 forks source link

Matching tokens #11

Open marbleman opened 9 years ago

marbleman commented 9 years ago

Hi,

I am stuck with this issue and I am quite sure I miss something really essential:

I setup the analyzer as below and it works quite well:

GET /myIndex/_analyze?analyzer=german&text=Straßenbahnschienenritzenreiniger

gives me all kinds of tokens. But: Searching returns all documents containing just ONE of the Tokens (with an OR-Operator so to say), ranking documents containing "straße" higher then documents containing "reiniiger" - ignoring multiple matches in the score. This is of course not what I intended...

However, I can see, that an AND-Operator for tokens would not do the right thing either... In fact the operation that could work would be something like (tokens derived from "straße" combined with OR) AND (tokens derived from "bahn" combined with OR) AND (...)

I could run analyze from the external application and build the AND-/OR-query there, but this does not seem to be quite elegant.

Is there another/better way?

"analysis": {
    "filter": {
       "baseform": {
          "type": "baseform",
          "language": "de"
       },
       "decomp": {
          "type": "decompound"
       }
    },
    "analyzer": {
       "german": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "custom",
          "tokenizer": "baseform"
       }
    },
    "tokenizer": {
       "baseform": {
          "filter": [
             "decomp",
             "baseform"
          ],
          "type": "standard"
       }
    }
 }
jprante commented 9 years ago

I only tried the decompounder as index analyzer right now. But I will have a look into the issue. It seems like a related issue when searching for synonyms using the synonym filter.

marbleman commented 9 years ago

I guess any filter adding words has to deal with that in some way: as long as you just search for one word adding synonyms with OR will be ok. But when searching two words... I'll setup a synonom filter the next days to cross check.

marbleman commented 9 years ago

It took quite while but I promised to come back with some details and here is what I found:

I used the explain API on a field having a baseform filter applied which adds a base form for verbs and process the phrase "hoch gezogen":

"query": { "multi_match": { "query": "hoch gezogen", "fields": ["title"], "operator": "and" } }

Result: "explanation": "+title:hoch +(title:gezog title:zieh)"

As expected the query will search for "hoch" AND ("gezog" OR "zieh") which is exactly what we expect. The synonym filter, will do the same thing.

However, when I use the decompounder, to explain a search for the phrase "Abfall Kunsstoff" the result is

"explanation": "+title:abfall +(title:kunststoff title:kunst title:stoff)"

As a matter of fact, we will find any documents talking about "Abfall" and "Stoff" or any kind of "Kunst Abfall"... Ok, one can find a lot of rubbish declared to be art....;) but that wasn't what our search was all about...

The correct search should look like: +title:abfall +(title:kunststoff | (+title:kunst +title:stoff)) Forgive me if this is not syntactically correct: We want "kunststoff" or ("kunst" and "stoff")

Ok, I admit the example is not too good... in fact "Kunststoff" should not be decompounded at all. But this is another issue...

So when you say, you've never used the decompounder on the query side: I cannot see a way for proper results if the decompounder was just applied to the index... In my understanding the intention of decompounding "Hochfrequenzumkehrschraube" is finding documents talking about "Schrauben für die Umkehrung von Hochfrequenz". And this is where I am stuck in some way...

fgrosse commented 8 years ago

I am running into exactly the same issue.

Lets say I index two documents where the text field is decompounded:

{ "_id" : 1, "text" : "...direkt im Stadtzentrum..." }
{ "_id" : 2, "text" : "... Forschungszentrum..." }

Stadtzentrum from document 1 is decompounded into stadt and zentrum. Forschungszentrum from document 2 is decompounded into forschung and zentrum.

Then I run the following search:

{
    "query": {
        "multi_match": {
           "query": "Forschungszentrum",
           "operator": "and",
           "fields": [ "title", "text"]
        }
    }
}

Unfortunately this returns both documents even though I used the and operator. I don't want to find everything that contains the term zentrum.

If the query were Forschung zentrum it works as expected but this is user input and can not be controlled.

Did you ever find a solution to this @marbleman ? @jprante If you want I can open a new issue at jprante/elasticsearch-plugin-bundle

fgrosse commented 8 years ago

P.S. We can not just use the decompounder only for indexing. Consider the following use case:

{ "_id" : 1, "text" : "Krebsforschungszentrum" }

Search:

{
    "query": {
        "match": { "text": "Forschungszentrum" }
    }
}

In that case the search term needs to be decompounded so we can find the Krebsforschungszentrum

AndreKR commented 8 years ago

It's impossible for a TokenFilter to have an interpretation like "+title:abfall +(title:kunststoff | (+title:kunst +title:stoff))" because of the way QueryBuilder.analyzeMultiBoolean() works. What we can have is an interpretation like "+title:abfall +title:kunst +title:stoff".

To get it, pull https://github.com/jprante/elasticsearch-analysis-decompound/pull/19 and set only_subwords: true.

jprante commented 8 years ago

@AndreKR thanks for fixing only_subwords.

Good analysis of QueryBuilder.analyzeMultiBoolean, there is only one boolean operator that can be used for the clause list. I think for improved token stream analysis on subwords, the whole query must be rewritten with transformed boolean operators so groups of and and or can be handled. This is something that should be done before token stream analysis within Lucene at the moment, because Lucene does not offer a good API for query transformations.

AndreKR commented 8 years ago

Honestly, I would even remove the only_subwords option and make it default to true. What is the use of getting the compound word along with its subwords? If we just get the subwords, the analyzed token stream can be freely used in whatever combination of queries.

jprante commented 8 years ago

@AndreKR you are right, with https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html it is possible to keep the compound word anyway. I will change the behavior in a new version.

marbleman commented 8 years ago

I am glad to see that this wasn't just a lack of understanding on my side ;-) And no: I did not find a workaround for it yet except building the query in another step. Since I did not find the time to walk through the code myself yet, I really appreciate a solution to this issue!

However, after getting around this one, there might be another related one: There are lots of compound words such as "Straßenbahn" or "Kugelbolzen" for example that must not be decompounded at all...

Let me know if you are interested in some exchange of experience

AndreKR commented 8 years ago

What's the harm in having Straßenbahn decompounded during indexing and searching? Anyway, there is a (currently undocumented) option respect_keywords that you can set to true and then you can block words from being decompounded in the same way as with other filters.

fgrosse commented 8 years ago

See #14 for respect_keywords pull request.

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

fgrosse commented 8 years ago

@jprante will you merge that change into https://github.com/jprante/elasticsearch-plugin-bundle/ as well and release a new version? I switched to elasticsearch-plugin-bundle as you recommended earlier. If not I will switch back to this repository.

jprante commented 8 years ago

Merged into bundle plugin release 2.1.0.1

AndreKR commented 8 years ago

I would be interested in some exchange. How can I reach you? don't want to spam the issue here to much :)

@fgrosse Who were you talking to? Anyway, my profile now has an email address.

fgrosse commented 8 years ago

Since the name of the configuration has been mixed up here two times as _onlysubwords I want to point out that the correct configuration option is called subwords_only