Plugin release for ElasticSearch 7.8

harmenjanssen commented 4 years ago

Hi @damienalexandre,

We're having the same issue you describe here, when using the dictionary file from this repository.

We've just upgraded from ElasticSearch 5.3 to 7.8 however, so we can't use your plugin to solve this issue (yet). Is a 7.8 release on your roadmap by any chance?

Thanks in advance!

damienalexandre commented 4 years ago

Hi!

There is no need for the plugin with Elasticsearch version >= 6.4 as the ICU library has been updated.

So with your 7.8 you just have to install the "analysis-icu" plugin (because you need to use icu_tokenizer) and use the dictionary as synomym token filter.

Something like this:

PUT /emoji-capable
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" 
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "english_emoji"
          ]
        }
      }
    }
  }
}

damienalexandre commented 4 years ago

I suggest this blog post for more information: https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji

harmenjanssen commented 4 years ago

Hmm, then maybe my question is wrong, haha.

I got the following error when creating the index:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synony
  m rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: \uD83C\uDFFB was completely eliminated by analyzer"}}},"status":400}

and assumed from that other thread I would need your plugin to fix this. But am I right in concluding the dictionary file can be used when I configure the ICU tokenizer?

damienalexandre commented 4 years ago

Yes, I suspect you didn't use ICU at all when you got this error?

harmenjanssen commented 4 years ago

That's true.

Thanks for getting back to me so quickly, I'm sure I can make it work. 🙂

harmenjanssen commented 4 years ago

We did make it work, eventually!

However, our client reported a strange bug in which the query "🍏☀️" would yield results, but "☀️🍏" would not.

Upon inspection, the first emoji is converted into synonyms, but the second one isn't:

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏☀️️️"
}

Response:

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "☀",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<EMOJI>",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 12
    }
  ]
}

When I flip the order of emoji, the ☀️ will be converted to synonyms, but the apple is not — very odd behavior. Have you ever seen anything like this?

For the record:

I'm using Elasticsearch 7.8
My analyzer includes the icu tokenizer and your synonyms list from this repo, but also a mapping to replace the invalid characters mentioned in #27 with valid characters.

Other than that there are some stemming and stopwords filters, but I've removed all of these and it doesn't seem to make a difference.

damienalexandre commented 4 years ago

Thanks for reporting this issue.

I have some questions:

did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?
the submitted string looks strange (copy pasted from your _analyze call):

uniscribe "🍏☀<fe0f><fe0f><fe0f>"

  1F34F ├─ 🍏        ├─ GREEN APPLE
   ---- ├┬ ☀️️️     ├┬ Composition
   2600 │├─ ☀       │├─ BLACK SUN WITH RAYS
   FE0F │├─ VS16    │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16    │├─ VARIATION SELECTOR-16
   FE0F │└─ VS16    │└─ VARIATION SELECTOR-16

VARIATION SELECTOR-16 is used to force the EMOJI version of ☀ but it's only needed once.

harmenjanssen commented 4 years ago

did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?

I remove it via a char filter, type mapping, with mappings like this:

'*=>star',
'✓=>checkmark',

the submitted string looks strange (copy pasted from your _analyze call):

I agree, it does! I inserted it into Kibana using the standard MacOS emoji picker. Upon insertion it changed into the more "text-like" sun thing you see in my code snippet.

However, the same thing happens with an avocado (which does look like an actual emoji):

GET /stedelijk_nl/_analyze
{
  "analyzer": "dutch_with_emoji",
  "text": "🍏🥑️️️"
}

Response

{
  "tokens" : [
    {
      "token" : """🍏""",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "appel",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "fruit",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "groen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : """🥑""",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "appel",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

damienalexandre commented 4 years ago

Just made some tests.

The invisible char you have in the string (FE0F │├─ VS16 │├─ VARIATION SELECTOR-16) is not understood by Elasticsearch standard analyzer, neither by icu_tokenizer.

So we need to clean off that emoji variation selector before giving them to the synonym token filter.

This can be done like this:

"emoji_variation_selector_filter": {
    "type": "pattern_replace",
    "pattern": "\\uFE0E|\\uFE0F",
    "replace": ""
}

Your search is 🍏🥑<fe0f><fe0f><fe0f>, it produce two tokens by default:

🍏
🥑<fe0f>

As 🥑<fe0f> is not in the synonym file you don't get the annotations.

When we apply the above filter we get those tokens:

🍏
🥑

And then the synonym filter can work to add the tokens!

I added this filter in the README, added tests and I'm now closing this issue. Feel free to comment if there is anything else!

See changes here: https://github.com/jolicode/emoji-search/commit/bea5b31d96ac641ebe6eace8da07ff0ff610bc2c

Also since last time the emoji files have been fixed for the "completely eliminated by analyzer" issue :wink:

harmenjanssen commented 4 years ago

That's great! Thanks so much for maintaining this repo and debugging this issue. I will implement the filter and download the new dictionary files.

harmenjanssen commented 4 years ago

Oddly enough I still get an error on the line:

〽 => 〽, mark, part, part alternation mark

 {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","c
  aused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1263","caused_by":{"type":"illegal_argument_exception","reason":"term: 〽 was completely eliminate
  d by analyzer"}}},"status":400}

It might be relevant to know we do not use the file as-is but add them programmatically through synonyms:

'filter' => [
    'english_emoji' => [
        'type' => 'synonym',
        'synonyms' => [],                // Will be filled by reading the synonyms file.
    ],
    // @see https://github.com/jolicode/emoji-search/issues/26
    'emoji_variation_selector_filter' => [
        'type' => 'pattern_replace',
        'pattern' => '\\uFE0E|\\uFE0F',
        'replace' => ''
    ],
    ...

Do you think that makes a difference? We're using ES 7.8.0 — your tests are succeeding on 7.8.1 so I'm assuming 7.8.0 should work as well...

harmenjanssen commented 4 years ago

Ah, got it: when using tokenizer icu_tokenizer it fails, but with tokenizer standard it works. I was under the impression I had to use icu_tokenizer.

harmenjanssen commented 4 years ago

So the synonyms file is correct right now, it imports correctly when using icu_tokenizer. However, the original problem of translating subsequent emoji into synonyms still does not work, look:

Request:

GET /stedelijk_staging_en/_analyze
{
  "analyzer": "english_with_emoji",
  "text": "🍏🥑️️️️"
}

Response:

``` { "tokens" : [ { "token" : """🍏""", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "appl", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "fruit", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "green", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : """🥑""", "start_offset" : 2, "end_offset" : 5, "type" : "", "position" : 1 } ] } ```

My english_with_emoji analyzer is setup like this:

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'english_emoji',
        'emoji_variation_selector_filter',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Any ideas?

damienalexandre commented 4 years ago

Hi! Long time no see :wave:

I took the time to test with icu_tokenizer and there was some emoji to remove, just opened #33 for that, thanks! (it's fully tested not with both the standard and icu tokenizers).

About your other issue, here is the full string you search:

  1F34F ├─ 🍏           ├─ GREEN APPLE
   ---- ├┬ 🥑️️️️               ├┬ Composition
  1F951 │├─ 🥑          │├─ AVOCADO
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │├─ VS16        │├─ VARIATION SELECTOR-16
   FE0F │└─ VS16        │└─ VARIATION SELECTOR-16

As you can see we have a lot of VARIATION SELECTOR.

For that you added the emoji_variation_selector_filter but you put it after the english_emoji token filter, it must be before.

'english_with_emoji' => [
    'char_filter' => [
        'html_strip'
    ],
    'tokenizer' => 'standard',
    'filter' => [
        'emoji_variation_selector_filter',
        'english_emoji',
        'lowercase',
        'english_stop',
        'english_stemmer',
    ],
],

Have a great day ;-)

harmenjanssen commented 4 years ago

Yes! That works fantastically! Thanks, that was easier than I expected.

I'm going to recreate my mappings and re-index. Thanks again!

jolicode / emoji-search

Plugin release for ElasticSearch 7.8 #26