jolicode / emoji-search

:smile: Emoji synonyms to build your own emoji-capable search engine (elasticsearch, solr, OpenSearch)
https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji
MIT License
220 stars 64 forks source link
analyzer cldr elasticsearch elasticsearch-plugin emoji emoticons hacktoberfest opensearch plugin

πŸ™‚ Emoji, flags & emoticons support for Elasticsearch

Add support for emoji and flags in any Lucene compatible search engine!

If you wish to search 🍩 to find donuts in your documents, you came to the right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.

Test all synonym files on a real Elasticsearch

Requirements to index emoji in Elasticsearch

There is no requirements for Elasticsearch >= 6.7.

Using older version of Elasticsearch? Open me! πŸ–± | Version | Requirements | |--------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Elasticsearch >= 6.4 and < 6.7 | You need to install the official [ICU Plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html). See our [blog post about this change](https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji). | | Elasticsearch < 6.4 | You need our [custom ICU Tokenizer Plugin](https://github.com/jolicode/emoji-search/tree/6.2.4/esplugin), see our [blog post](http://jolicode.com/blog/search-for-emoji-with-elasticsearch) (2016). | Run the following test to verify that you get 4 EMOJI tokens: ```json GET _analyze { "text": ["🍩 πŸ‡«πŸ‡· πŸ‘©β€πŸš’ πŸš£πŸΎβ€β™€"] } ```

The Synonyms, flags and emoticons

What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.

We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:

πŸ‘©β€πŸš’ => πŸ‘©β€πŸš’, firefighter, firetruck, woman
πŸ‘©β€βœˆ => πŸ‘©β€βœˆ, pilot, plane, woman
πŸ₯“ => πŸ₯“, bacon, meat, food
πŸ₯” => πŸ₯”, potato, vegetable, food
πŸ˜… => πŸ˜…, cold, face, open, smile, sweat
πŸ˜† => πŸ˜†, face, laugh, mouth, open, satisfied, smile
🚎 => 🚎, bus, tram, trolley
πŸ‡«πŸ‡· => πŸ‡«πŸ‡·, france
πŸ‡¬πŸ‡§ => πŸ‡¬πŸ‡§, united kingdom

For emoticons, use this mapping with a char_filter to replace emoticons by emoji.

Installation

Download the emoji and emoticon file you want from this repository and store them in PATH_TO_ES/config/analysis (or anywhere Elasticsearch can read).

config
β”œβ”€β”€ analysis
β”‚Β Β  β”œβ”€β”€ cldr-emoji-annotation-synonyms-en.txt
β”‚Β Β  └── emoticons.txt
β”œβ”€β”€ elasticsearch.yml
...

Use them like this (this is a complete english example with Elasticsearch >= 6.7):

PUT /tweets
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
        },
        "emoji_variation_selector_filter": {
          "type": "pattern_replace",
          "pattern": "\\uFE0E|\\uFE0F",
          "replace": ""
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "emoji_variation_selector_filter",
            "english_emoji",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "english_with_emoji"
      }
    }
  }
}

You can now test the result with:

GET tweets/_analyze
{
  "field": "content",
  "text": "🍩 πŸ‡«πŸ‡· πŸ‘©β€πŸš’ πŸš£πŸΎβ€β™€"
}

How to contribute

Build from CLDR SVN

You will need:

Edit the tag in tools/build-released.php and run php tools/build-released.php.

Update emoticons

Run php tools/build-emoticon.php.

Licenses

Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).

This repository in distributed under MIT License. Feel free to use and contribute as you please!