This plugin has been superceded by the completion suggester in Elasticsearch and is not developed further. There is an excellent introductory blog post available as well.
This plugin is not developed further than for Elasticsearch 1.3, which you should not use anymore!
Note: If you only need prefix suggestions, please use the new completion suggest
feature available since elasticsearch 0.90.3, which features blazing fast real time suggestions, uses the AnalyzingSuggester
under the hood and will also support fuzzy mode in 0.90.4.
This plugin uses the FSTSuggester, the AnalyzingSuggester or the FuzzySuggester from Lucene to create suggestions from a certain field for a specified term instead of returning the whole document data.
Feel free to comment, improve and help - I am thankful for any insights, no matter whether you want to help with elasticsearch, lucene or my other flaws I will have done for sure.
Oh and in case you have not read it above:
In case you want to contact me, drop me a mail at alexander@reelsen.net
Because elasticsearch now comes with its own suggest API (not based on in-memory automatons per shard), big parts of this plugin needs to be changed.
Both REST endpoints have been moved. The /_suggest
endpoint now resides at __suggest
. Refreshing has changed from _suggestRefresh
to __suggestRefresh
.
I do not like this renaming either, but I have not yet got the ieda of a better name.
I am totally open for better names. This is a WIP until elasticsearch 1.0 is released.
Everything is now in the de.spinscale
package name space in order to avoid clashes. This means, if you are using the request builder classes, you will have to change your application.
If you do not want to work on the repository, just use the standard elasticsearch plugin command (inside your elasticsearch/bin directory)
bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/0.90.5-0.9
Note: Please make sure the plugin version matches with your elasticsearch version. Follow this compatibility matrix
----------------------------------------
| suggest plugin | Elasticsearch |
----------------------------------------
| 1.3.2-2.0.1 | 1.3.2 -> master |
----------------------------------------
| 1.0.1-2.0.0 | 1.0.1 |
----------------------------------------
| 0.90.12-1.1 | 0.90.12 |
----------------------------------------
| 0.90.7-1.0 | 0.90.7 |
----------------------------------------
| 0.90.5-0.9 | 0.90.5 |
----------------------------------------
| 0.90.3-0.8.* | 0.90.3 |
----------------------------------------
| 0.90.1-0.7 | 0.90.1 |
----------------------------------------
| 0.90.0-0.6.* | 0.90.0 |
----------------------------------------
| 0.20.5-0.5 | 0.20.5 -> 0.20.6 |
----------------------------------------
| 0.20.2-0.4 | 0.20.2 -> 0.20.4 |
----------------------------------------
| 0.19.12-0.2 | 0.19.12 |
----------------------------------------
| 0.19.11-0.1 | 0.19.11 |
----------------------------------------
If you want to work on the repository
git clone git://github.com/spinscale/elasticsearch-suggest-plugin.git
git tag
) you want to build with (possibly master is not for your elasticsearch version)mvn clean package -DskipTests=true
- this does not run any unit tests, as they take some time. If you want to run them, better run mvn clean package
/path/to/elasticsearch/bin/plugin -install elasticsearch-suggest -url file:///$PWD/target/releases/elasticsearch-suggest-$version.zip
Alternatively you can now use this plugin via maven and include it via the sonatype repo likes this in your pom.xml (or any other dependency manager)
<repositories>
<repository>
<id>Sonatype</id>
<name>Sonatype</name>
<url>http://oss.sonatype.org/content/repositories/releases/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>de.spinscale</groupId>
<artifactId>elasticsearch-suggest-plugin</artifactId>
<version>1.3.2-2.0.1</version>
</dependency>
...
<dependencies>
The maven repo can be visited at https://oss.sonatype.org/content/repositories/releases/de/spinscale/elasticsearch-plugin-suggest/
Fire up curl like this, in case you have a products index and the right fields - if not, read below how to setup a clean elasticsearch in order to support suggestions.
# curl -X POST 'localhost:9200/products1/product/__suggest?pretty=1' -d '{ "field": "ProductName.suggest", "term": "tischwäsche", "size": "10" }'
{
"suggest" : [ "tischwäsche", "tischwäsche 100",
"tischwäsche aberdeen", "tischwäsche acryl", "tischwäsche ambiente",
"tischwäsche aquarius", "tischwäsche atlanta", "tischwäsche atlas",
"tischwäsche augsburg", "tischwäsche aus", "tischwäsche austria" ]
}
As you can see, this queries the products index for the field ProductName.suggest
with the specified term and size.
You can also use HTTP GET for getting suggestions - even with the callback
and the source
parameters like in any normal elasticsearch search.
You might want to check out the included unit test as well. I use a shingle filter in my examples, take a look at the files in src/test/resources
directory.
With Lucene 4 (and the upgrade to elasticsearch 0.90.0) two new suggesters were added, one of them the AnalyzingSuggester and the FuzzySuggester based on the first one. Both have the great capability of returning the original form, but search on an analyzed one. Take this example (notice the search for a lowercase b
, but getting back the original field name):
» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "b", "analyzer" : "standard" }'
{"suggestions":["BMW 320","BMW 525d"],"_shards":{"total":5,"successful":5,"failed":0}}
Note: If you use type full
or type fuzzy
, the similarity
parameter will not have any effect. In addition, these parameters are supported only for full
and fuzzy
:
analyzer
:index_analyzer
:search_analyzer
:This suggester can even ignore stopwords if configured appropriately - but only if you disable position increments for stopwords. Use this mapping and index settings when creating an index:
curl -X DELETE localhost:9200/cars
curl -X PUT localhost:9200/cars -d '{
"mappings" : {
"car" : {
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name": { "type": "string", "index": "not_analyzed" }
}
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"suggest_analyzer_stopwords" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "stopword_no_position_increment" ]
},
"suggest_analyzer_synonyms" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "my_synonyms" ]
}
},
"filter" : {
"stopword_no_position_increment" : {
"type" : "stop",
"enable_position_increments" : false
},
"my_synonyms" : {
"type" : "synonym",
"synonyms" : [ "jetta, bora" ]
}
}
}
}
}'
curl -X POST localhost:9200/cars/car -d '{ "name" : "The BMW ever" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "BMW 320" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "BMW 525d" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "VW Jetta" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "VW Bora" }'
Now when querying with a stopwords analyzer, you can even get back The BMW ever
» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "b", "analyzer" : "suggest_analyzer_stopwords" }'
{"suggestions":["BMW 320","BMW 525d","The BMW ever"],"_shards":{"total":5,"successful":5,"failed":0}}
Or you could use synonyms (FYI: jetta and bora were the same cars, but named different in USA and Europe, so a search should return both)
» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "vw je", "analyzer" : "suggest_analyzer_synonyms" }'
{"suggestions":["VW Bora","VW Jetta"],"_shards":{"total":5,"successful":5,"failed":0}}
The FuzzySuggester uses LevenShtein distance to cater for typos.
» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "fuzzy", "term" : "bwm", "analyzer" : "standard" }'
{"suggestions":["BMW 320","BMW 525d"],"_shards":{"total":5,"successful":5,"failed":0}}
The FuzzySuggester
and the AnalyzingSuggester
suggesters contain a method to find out their size, which is also exposed as an own endpoint, in case you want to monitor memory consumption of the in-memory structures.
» curl localhost:9200/__suggestStatistics
{"_shards":{"total":2,"successful":2,"failed":0},"fstStats":{"cars-0":[{"analyzingsuggester-name-queryAnalyzer:suggest_analyzer_synonyms-indexAnalyzer:suggest_analyzer_synonyms":147},{"analyzingsuggester-name-queryAnalyzer:suggest_analyzer_stopwords-indexAnalyzer:suggest_analyzer_stopwords":126}]}}
Furthermore the suggest data is not updated, whenever you index a new product but every few minutes. The default is to update the index every 10 minutes, but you can change that in your elasticsearch.yml configuration:
suggest:
refresh_interval: 600s
In this case the suggest indexes are refreshed every 10 minutes. This is also the default. You can use values like "10s", "10ms" or "10m" as with most other time based configuration settings in elasticsearch.
If you want to deactivate automatic refresh completely, put this in your elasticsearch configuration
suggest:
refresh_disabled: true
If you want to refresh your FST suggesters manually instead of waiting for 10 minutes just issue a POST request to the /__suggestRefresh
URL.
# curl -X POST 'localhost:9200/__suggestRefresh'
# curl -X POST 'localhost:9200/products/product/__suggestRefresh'
# curl -X POST 'localhost:9200/products/product/__suggestRefresh' -d '{ "field" : "ProductName.suggest" }'
SuggestRequest request = new SuggestRequest(index);
request.term(term);
request.field(field);
request.size(size);
request.similarity(similarity);
SuggestResponse response = node.client().execute(SuggestAction.INSTANCE, request).actionGet();
Refresh works like this - you can add an index and a field in the suggest refresh request as well, if you want to trigger it externally:
SuggestRefreshRequest refreshRequest = new SuggestRefreshRequest();
SuggestRefreshResponse response = node.client().execute(SuggestRefreshAction.INSTANCE, refreshRequest).actionGet();
You can also use the included builders
List<String> suggestions = new SuggestRequestBuilder(client)
.field(field)
.term(term)
.size(size)
.similarity(similarity)
.execute().actionGet().suggestions();
SuggestRefreshRequestBuilder builder = new SuggestRefreshRequestBuilder(client);
builder.execute().actionGet();
$index/_suggest
and $index/_suggestRefresh
urlssuggest.refresh_disabled = true
in order to deactivate automatic refreshing of the suggest indexSuggestRequestBuilder
and SuggestRefreshRequestBuilder
classes - results in easy to use request classes (check the examples and tests)This HOWTO will help you to setup a clean elasticsearch installation with the correct index settings and mappings, so you can use the plugin as easy as possible. We will setup elasticsearch, index some products and query those for suggestions.
Get elasticsearch, install it, get this plugin, install it.
Add a suggest and a lowercase analyzer to your elasticsearch/config/elasticsearch.yml
config file (or do it on index creation whatever you like)
index:
analysis:
analyzer:
lowercase_analyzer:
type: custom
tokenizer: standard
filter: [standard, lowercase]
suggest_analyzer:
type: custom
tokenizer: standard
filter: [standard, lowercase, shingle]
Start elasticsearch and create a mapping. You can either create it via configuration in a file or during index creation. We will create an index with a mapping now
curl -X PUT localhost:9200/products -d '{
"mappings" : {
"product" : {
"properties" : {
"ProductId": { "type": "string", "index": "not_analyzed" },
"ProductName" : {
"type" : "multi_field",
"fields" : {
"ProductName": { "type": "string", "index": "not_analyzed" },
"lowercase": { "type": "string", "analyzer": "lowercase_analyzer" },
"suggest" : { "type": "string", "analyzer": "suggest_analyzer" }
}
}
}
}
}
}'
Lets add some products
for i in 1 2 3 4 5 6 7 8 9 10 100 101 1000; do
json=$(printf '{"ProductId": "%s", "ProductName": "%s" }', $i, "My Product $i")
curl -X PUT localhost:9200/products/product/$i -d "$json"
done
Time to query and understand the different analyzers
Queries the not analyzed field, returns 10 matches (default), always the full product name:
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "My" }'
Queries the not analyzed field, returns nothing (because lowercase):
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "my" }'
Queries the lowercase field, returns only the occuring word (which is pretty bad for suggests):
curl -X POST localhost:9200/products/product/_suggest -d '{ "field":
"ProductName.lowercase", "term": "m" }'
Queries the suggest field, returns two words (this is the default length of the shingle filter), in this case "my" and "my product"
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "my" }'
Queries the suggest field, returns ten product names as we started with the second word + another one due to the shingle
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product" }'
Queries the suggest field, returns all products with "product 1" in the shingle
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1" }'
The same query as above, but limits the result set to two
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1", "size": 2 }'
And last but not least, typo finding, the query without similarity parameter set returns nothing:
curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "proudct", similarity: 0.7 }'
The similarity is a float between 0.0 and 1.0 - if it is not specified 1.0 is used, which means it must match exactly. I've found 0.7 ok for cases, when two letters were exchanged, but mileage may very as I tested merely on german product names.
With the tests I did, a shingle filter held the best results. Please check http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter.html for more information about setup, like the default tokenization of two terms.
Now test with your data, come up and improve this configuration. I am happy to hear about your specific configuration for successful suggestion queries.