jolicode / emoji-search

:smile: Emoji synonyms to build your own emoji-capable search engine (elasticsearch, solr, OpenSearch)
https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji
MIT License
218 stars 64 forks source link

cldr-emoji-annotation-synonyms-en.txt: Terms "completely eliminated by analyzer" #27

Closed vicchi closed 4 years ago

vicchi commented 4 years ago

Hi @damienalexandre ... Thank you for continuing to collate and update the synonyms files ...

Related to #26, I've used your example mappings with one slight change (placing the synonyms file in /etc/elasticsearch/synonyms instead) as follows:

  1. Installed the analysis-icu plugin and restarted Elasticsearch
  2. Copied the synonyms file as sudo cp synonyms/cldr-emoji-annotation-synonyms-en.txt /etc/elasticsearch/synonyms/

Then ...

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "synonyms/cldr-emoji-annotation-synonyms-en.txt" 
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  }
}

which gives me ...

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "failed to build synonyms"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "failed to build synonyms",
    "caused_by" : {
      "type" : "parse_exception",
      "reason" : "Invalid synonym rule at line 859",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "term: * was completely eliminated by analyzer"
      }
    }
  },
  "status" : 400
}

Line 859 is the first instance of a synonym which has a non-alpha synonym, in this case:

✨ => ✨, *, sparkle, sparkles, star

Removing the * from the definition works but then the same issue recurs from line 1257 (✅ => ✅, ✓, button, check, mark) onwards.

This is on Elasticsearch 7.8.0 on Ubuntu 20.04.

Is this a problem with the synonyms file or am I missing something very obvious?

damienalexandre commented 4 years ago

Hi!

It seems there is an issue with the synonym files and ICU. This error happens when the tokenizer completely remove a string, and * and are neither emoji nor "text".

Thanks for letting us know, I will work on a patch and I also wish to add tests (see #12) to avoid any issues like this upon futur Elasticsearch releases.

vicchi commented 4 years ago

@damienalexandre Thank you!

damienalexandre commented 4 years ago

This is resolved in https://github.com/jolicode/emoji-search/commit/e5309a88cf25d7a6e3c81568af4c7509b6012442 ; I fixed all the files in all languages and added automated tests to check them on each changes.

Thanks for reporting the issue!

Cheers

vicchi commented 4 years ago

@damienalexandre Only just got around to testing this today and so far, the signs are good. Thanks once again