jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
GNU Affero General Public License v3.0
110 stars 17 forks source link

decompound filter returns non-compound words twice #9

Open ackermann opened 8 years ago

ackermann commented 8 years ago

First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).

This is the simplified version of my index settings to reproduce the problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder]
            filter:
                decompounder:
                    type: decompound

Querying /_analyze with the text Grundbuchamt Anwältin returns:

tokens:
- token: "Grundbuchamt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Grund"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "buch"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "amt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1

As you can see, the token Anwältin is returned twice with the same offset and position.

(Setting subwords_only to true eliminates the duplicates by the way.)

Do you have an idea how we might fix this behaviour?

jprante commented 8 years ago

There may be a flaw. As a workaround, removing duplicates from token stream can be performed by a standard "unique" filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

ackermann commented 8 years ago

Thanks! I just came back to post this as well. What's important to note is that the unique filter should be used with only_on_same_position: true, because otherwise the term frequency will be heavily distorted.

As an example for others with the same problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder, unique_decomp]
            filter:
                unique_decomp:
                    type: unique
                    only_on_same_position: true
                decompounder:
                    type: decompound