image:https://api.travis-ci.org/jprante/elasticsearch-plugin-bundle.svg[title="Build status", link="https://travis-ci.org/jprante/elasticsearch-plugin-bundle/"] image:https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib.elasticsearch.plugin%3Aelasticsearch-plugin-bundle/coverage.svg?style=flat-square[title="Coverage", link="https://sonarqube.com/dashboard/index?id=org.xbib.elasticsearch.plugin%3Aelasticsearch-plugin-bundle"] image:https://maven-badges.herokuapp.com/maven-central/org.xbib.elasticsearch.plugin/elasticsearch-plugin-bundle/badge.svg[title="Maven Central", link="http://search.maven.org/#search%7Cga%7C1%7Cxbib%20elasticsearch-plugin-bundle"] image:https://img.shields.io/badge/License-Apache%202.0-blue.svg[title="Apache License 2.0", link="https://opensource.org/licenses/Apache-2.0"] image:https://img.shields.io/twitter/url/https/twitter.com/xbib.svg?style=social&label=Follow%20%40xbib[title="Twitter", link="https://twitter.com/xbib"] image:https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif[title="PayPal", link="https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=GVHFQYZ9WZ8HG"]
IMPORTANT: Because this Elasticsearch plugin is licensed under AGPL 3.0 which is not compatible to SSPL or Elastic License 2.0, the last version supported is Elasticsearch 7.10.2. Sorry for inconvenience and thank you for your understanding.
This plugin is the combination of the following plugins:
The plugin code in each plugin is equivalent to the code in this combined bundle plugin.
.Compatibility matrix [frame="all"] |=== | Plugin version | Elasticsearch version | Release date | 6.3.2.2 | 6.3.2 | Oct 2, 2018 | 5.4.1.0 | 5.4.0 | Jun 1, 2017 | 5.4.0.1 | 5.4.0 | May 12, 2017 | 5.4.0.0 | 5.4.0 | May 4, 2017 | 5.3.1.0 | 5.3.1 | Apr 25, 2017 | 5.3.0.0 | 5.3.0 | Apr 4, 2017 | 5.2.2.0 | 5.2.2 | Mar 2, 2017 | 5.2.1.0 | 5.2.1 | Feb 27, 2017 | 5.1.1.2 | 5.1.1 | Feb 27, 2017 | 5.1.1.0 | 5.1.1 | Dec 31, 2016 | 2.3.4.0 | 2.3.4 | Jul 30, 2016 | 2.3.3.0 | 2.3.3 | May 23, 2016 | 2.3.2.0 | 2.3.2 | May 11, 2016 | 2.2.0.6 | 2.2.0 | Mar 25, 2016 | 2.2.0.3 | 2.2.0 | Mar 6, 2016 | 2.2.0.2 | 2.2.0 | Mar 3, 2016 | 2.2.0.1 | 2.2.0 | Feb 22, 2016 | 2.2.0.0 | 2.2.0 | Feb 8, 2016 | 2.1.1.2 | 2.1.1 | Dec 30, 2015 | 2.1.1.0 | 2.1.1 | Dec 21, 2015 | 2.1.0.0 | 2.1.0 | Nov 27, 2015 | 2.0.0.0 | 2.0.0 | Oct 28, 2015 | 1.6.0.0 | 1.6.0 | Jun 30, 2015 | 1.5.2.1 | 1.5.2 | Jun 30, 2015 | 1.5.2.0 | 1.5.2 | Apr 27, 2015 | 1.5.1.0 | 1.5.1 | Apr 23, 2015 | 1.5.0.0 | 1.5.0 | Mar 31, 2015 | 1.4.4.0 | 1.4.4 | Apr 26, 2015 | 1.4.0.6 | 1.4.0 | Feb 23, 2015 | 1.4.0.5 | 1.4.0 | Jan 28, 2015 | 1.4.0.4 | 1.4.0 | Jan 19, 2015 | 1.4.0.3 | 1.4.0 | Dec 16, 2014 | 1.4.0.1 | 1.4.0 | Nov 10, 2014 |===
or
Do not forget to restart the node after installing.
Do not forget to restart the node after installing.
Do not forget to restart the node after installing.
Hyphen analyzer
https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/docs/asciidoc/hyphen.adoc
ICU
https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/docs/asciidoc/icu.adoc
Langdetect
https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/docs/asciidoc/langdetect.adoc
Standardnumber
More to come.
The german_normalizer
is equivalent to Elasticsearch german_normalization
. It performs umlaut treatment
with vocal expansion which is typical for german language.
PUT /test { "settings": { "index": { "analysis": { "filter": { "umlaut": { "type": "german_normalize" } }, "analyzer": { "umlaut": { "type": "custom", "tokenizer": "standard", "filter": [ "umlaut", "lowercase" ] } } } } }, "mappings": { "docs": { "properties": { "text": { "type": "text", "analyzer": "umlaut" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "Jörg Prante" }
POST /test/docs/_search?explain { "query": { "match": { "text": "Jörg" } } }
POST /test/docs/_search?explain { "query": { "match": { "text": "joerg" } } }
The plugin contains an extended version of the Lucene ICU functionality with a dependancy on ICU 58.2
Available are icu_collation
, icu_folding
, icu_tokenizer
, icu_numberformat
, icu_transform
The icu_collation
analyzer can apply rbbi ICU rule files on a field.
PUT /test { "settings": { "index": { "analysis": { "analyzer": { "icu_german_collate": { "type": "icu_collation", "language": "de", "country": "DE", "strength": "primary", "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü" }, "icu_german_collate_without_punct": { "type": "icu_collation", "language": "de", "country": "DE", "strength": "quaternary", "alternate": "shifted", "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü" } } } } }, "mappings": { "docs": { "properties": { "text": { "type": "text", "fielddata" : true, "analyzer": "icu_german_collate" }, "catalog_text" : { "type": "text", "fielddata" : true, "analyzer": "icu_german_collate_without_punct" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "Göbel", "catalog_text" : "Göbel" }
PUT /test/docs/2 { "text" : "Goethe", "catalog_text" : "G-oethe" }
PUT /test/docs/3 { "text" : "Goldmann", "catalog_text" : "Gold*mann" }
PUT /test/docs/4 { "text" : "Göthe", "catalog_text" : "Göthe" }
PUT /test/docs/5 { "text" : "Götz", "catalog_text" : "Götz" }
POST /test/docs/_search { "query": { "match_all": { } }, "sort" : { "text" : { "order" : "asc" } } }
The icu_folding
character filter folds characters in strings according to Unicode folding rules.
UTR#30 is retracted, but still used here.
PUT /test { "settings": { "index":{ "analysis":{ "char_filter" : { "my_icu_folder" : { "type" : "icu_folding" } }, "tokenizer" : { "my_icu_tokenizer" : { "type" : "icu_tokenizer" } }, "filter" : { "my_icu_folder_filter" : { "type" : "icu_folding" }, "my_icu_folder_filter_with_exceptions" : { "type" : "icu_folding", "name" : "utr30", "unicodeSetFilter" : "[^åäöÅÄÖ]" } }, "analyzer" : { "my_icu_analyzer" : { "type" : "custom", "tokenizer" : "my_icu_tokenizer", "filter" : [ "my_icu_folder_filter" ] }, "my_icu_analyzer_with_exceptions" : { "type" : "custom", "tokenizer" : "my_icu_tokenizer", "filter" : [ "my_icu_folder_filter_with_exceptions" ] } } } } }, "mappings": { "docs": { "properties": { "text": { "type": "text", "fielddata" : true, "analyzer": "my_icu_analyzer" }, "text2" : { "type": "text", "fielddata" : true, "analyzer": "my_icu_analyzer_with_exceptions" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "Jörg Prante", "text2" : "Jörg Prante" }
POST /test/docs/_search { "query": { "match": { "text" : "jörg" } } }
POST /test/docs/_search { "query": { "match": { "text" : "jorg" } } }
POST /test/docs/_search { "query": { "match": { "text2" : "jörg" } } }
// no hit
The icu_tokenizer
can use rules from file. Here, we set up rules to prevent tokenization of words with hyphen.
PUT /test { "settings": { "index": { "analysis": { "tokenizer": { "my_hyphen_icu_tokenizer": { "type": "icu_tokenizer", "rulefiles": "Latn:icu/Latin-dont-break-on-hyphens.rbbi" } }, "analyzer" : { "my_icu_analyzer" : { "type" : "custom", "tokenizer" : "my_hyphen_icu_tokenizer" } } } } }, "mappings": { "docs": { "properties": { "text": { "type": "text", "analyzer": "my_icu_analyzer" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "we do-not-break on hyphens" }
With the icu_numberformat
filter, you can index numbers as they are spelled out in a language.
PUT /test { "settings": { "index":{ "analysis":{ "filter" : { "spellout_de" : { "type" : "icu_numberformat", "locale" : "de", "format" : "spellout" } }, "analyzer" : { "my_icu_analyzer" : { "type" : "custom", "tokenizer" : "standard", "filter" : [ "spellout_de" ] } } } } }, "mappings": { "docs": { "properties": { "text": { "type": "text", "analyzer": "my_icu_analyzer" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "Das sind 1000 Bücher" }
{
"index":{
"analysis":{
"filter":{
"baseform":{
"type" : "baseform",
"language" : "de"
}
},
"tokenizer" : {
"baseform" : {
"type" : "standard",
"filter" : [ "baseform", "unique" ]
}
}
}
}
}
{
"index":{
"analysis":{
"filter" : {
"wd" : {
"type" : "worddelimiter2",
"generate_word_parts" : true,
"generate_number_parts" : true,
"catenate_all" : true,
"split_on_case_change" : true,
"split_on_numerics" : true,
"stem_english_possessive" : true
}
}
}
}
}
This is an implementation of a word decompounder plugin for link:http://github.com/elasticsearch/elasticsearch[Elasticsearch].
Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.
This code is a reworked implementation of the link:http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm[Baseforms Tool] found in the http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/index.htm[ASV toolbox] of http://asv.informatik.uni-leipzig.de/staff/Chris_Biemann[Chris Biemann], Automatische Sprachverarbeitung of Leipzig University.
Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.
In the mapping, use a token filter of type "decompound"::
{ "index":{ "analysis":{ "filter":{ "decomp":{ "type" : "decompound" } }, "tokenizer" : { "decomp" : { "type" : "standard", "filter" : [ "decomp" ] } } } } }
"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"
It is recommended to add the Unique token filter <http://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html>
_ to skip tokens that occur more than once.
Also the Lucene german normalization token filter is provided::
{
"index":{
"analysis":{
"filter":{
"umlaut":{
"type":"german_normalize"
}
},
"tokenizer" : {
"umlaut" : {
"type":"standard",
"filter" : "umlaut"
}
}
}
}
}
The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".
The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.
The default threshold value is 0.51. You can modify it in the settings::
{
"index" : {
"analysis" : {
"filter" : {
"decomp" : {
"type" : "decompound",
"threshold" : 0.51
}
},
"tokenizer" : {
"decomp" : {
"type" : "standard",
"filter" : [ "decomp" ]
}
}
}
}
}
Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true
{
"index" : {
"analysis" : {
"filter" : {
"decomp" : {
"type" : "decompound",
"subwords_only" : true
}
},
"tokenizer" : {
"decomp" : {
"type" : "standard",
"filter" : [ "decomp" ]
}
}
}
}
}
The time consumed by the decompound computation may increase your overall indexing time drastically if applied in the billions. You can configure a least-frequently-used cache for mapping a token to the decompounded tokens with the following settings:
use_cache: true
- enables caching
cache_size
- sets cache size, default: 100000
cache_eviction_factor
- sets cache eviction factor, valida values are between 0.00 and 1.00, default: 0.90
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"decomp":{
"type" : "decompound",
"use_payload": true,
"use_cache": true
}
},
"analyzer": {
"decomp": {
"type": "custom",
"tokenizer" : "standard",
"filter" : [
"decomp",
"lowercase"
]
},
"lowercase": {
"type": "custom",
"tokenizer" : "standard",
"filter" : [
"lowercase"
]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "decomp",
"search_analyzer": "lowercase"
}
}
}
}
}
The usage of decompounds can lead to undesired results regarding phrase queries.
After indexing, decompound tokens ca not be distinguished from original tokens.
The outcome of a phrase query "Deutsche Bank" could be Deutsche Spielbankgesellschaft
,
what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is
tagged with additional payload data.
To evaluate this payload data, you can use the exact_phrase
as a wrapper around a query
containing your phrase queries.
use_payload
- if set to true, enable payload creation. Default: false
{
"query": {
"exact_phrase": {
"query": {
"query_string": {
"query": "\"deutsche bank\"",
"fields": [
"message"
]
}
}
}
}
}
curl -XDELETE 'localhost:9200/test'
curl -XPUT 'localhost:9200/test'
curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}
'
curl -XPUT 'localhost:9200/test/article/1' -d '
{
"title" : "Some title",
"content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}
'
curl -XPUT 'localhost:9200/test/article/2' -d '
{
"title" : "Ein Titel",
"content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}
'
curl -XPUT 'localhost:9200/test/article/3' -d '
{
"title" : "Un titre",
"content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}
'
curl -XGET 'localhost:9200/test/_refresh'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "en"
}
}
}
'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "de"
}
}
}
'
curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "fr"
}
}
}
'
{
"index" : {
"analysis" : {
"filter" : {
"standardnumber" : {
"type" : "standardnumber"
}
},
"analyzer" : {
"standardnumber" : {
"tokenizer" : "whitespace",
"filter" : [ "standardnumber", "unique" ]
}
}
}
}
}
WordDelimiterFilter2: taken from Lucene
baseform: index also base forms of words (german, english)
decompound: decompose words if possible (german)
langdetect: find language code of detected languages
standardnumber: standard number entity recognition
hyphen: token filter for shingling and combining hyphenated words (german: Bindestrichwörter), the opposite of the decompound token filter
sortform: process string forms for bibliographical sorting, taking non-sort areas into account
year: token filter for 4-digit sequences
reference:
{
"someType" : {
"_source" : {
"enabled": false
},
"properties" : {
"someField":{ "type" : "crypt", "algo": "SHA-512" }
}
}
}
All feedback is welcome! If you find issues, please post them at Github
The decompunder is a derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden
Copyright (C) 2005 Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig
The Compact Patricia Trie data structure can be found in
Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534
The compound splitter used for generating features for document classification is described in
Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland
The base form reduction step (for Norwegian) is described in
Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland
elasticsearch-plugin-bundle - a compilation of useful plugins for Elasticsearch
Copyright (C) 2014 Jörg Prante
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.