Add support for emoji and flags in any Lucene compatible search engine!
If you wish to search π©
to find donuts in your documents, you came to the
right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.
There is no requirements for Elasticsearch >= 6.7.
What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.
We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:
π©βπ => π©βπ, firefighter, firetruck, woman
π©ββ => π©ββ, pilot, plane, woman
π₯ => π₯, bacon, meat, food
π₯ => π₯, potato, vegetable, food
π
=> π
, cold, face, open, smile, sweat
π => π, face, laugh, mouth, open, satisfied, smile
π => π, bus, tram, trolley
π«π· => π«π·, france
π¬π§ => π¬π§, united kingdom
For emoticons, use this mapping with a char_filter to replace emoticons by emoji.
Download the emoji and emoticon file you want from this repository and store
them in PATH_TO_ES/config/analysis
(or anywhere Elasticsearch can read).
config
βββ analysis
βΒ Β βββ cldr-emoji-annotation-synonyms-en.txt
βΒ Β βββ emoticons.txt
βββ elasticsearch.yml
...
Use them like this (this is a complete english example with Elasticsearch >= 6.7):
PUT /tweets
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
},
"emoji_variation_selector_filter": {
"type": "pattern_replace",
"pattern": "\\uFE0E|\\uFE0F",
"replace": ""
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"emoji_variation_selector_filter",
"english_emoji",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
}
You can now test the result with:
GET tweets/_analyze
{
"field": "content",
"text": "π© π«π· π©βπ π£πΎββ"
}
You will need:
Edit the tag in tools/build-released.php
and run php tools/build-released.php
.
Run php tools/build-emoticon.php
.
Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).
This repository in distributed under MIT License. Feel free to use and contribute as you please!