Open ackermann opened 8 years ago
There may be a flaw. As a workaround, removing duplicates from token stream can be performed by a standard "unique" filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html
Thanks! I just came back to post this as well. What's important to note is that the unique
filter should be used with only_on_same_position: true
, because otherwise the term frequency will be heavily distorted.
As an example for others with the same problem:
settings:
index:
analysis:
analyzer:
german_analyzer:
type: custom
tokenizer: standard
filter: [decompounder, unique_decomp]
filter:
unique_decomp:
type: unique
only_on_same_position: true
decompounder:
type: decompound
First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).
This is the simplified version of my index settings to reproduce the problem:
Querying
/_analyze
with the textGrundbuchamt Anwältin
returns:As you can see, the token
Anwältin
is returned twice with the same offset and position.(Setting
subwords_only
to true eliminates the duplicates by the way.)Do you have an idea how we might fix this behaviour?