Upgrade to version 35 - Githubissues

harmenjanssen commented 5 years ago

I upgraded the synonyms lists per the instructions: updated version in the PHP script and ran php build-released.php.

Thanks for maintaining this repo!

damienalexandre commented 5 years ago

The ICU version shipped with Elastic 7.2 is 62.1: https://github.com/elastic/elasticsearch/blob/v7.2.0/buildSrc/version.properties#L14
The CLDR version shipped in ICU 62.1 is 33.1: http://site.icu-project.org/download/62

So we can't really upgrade to version 35. If the synonym contains a token that the "icu_tokenizer" can't process / doesn't understand, it could produce an error (last time I updated it was the case).

Maybe you could test with a new emoji?

🧏 => 🧏, accessibility, deaf, deaf person, ear, hear
🦦 => 🦦, otter, speel, visvang
🧇 => 🧇, indecisive, iron, waffle

What happens if you try those with the _analyze API + your synonym files?

Thanks!

harmenjanssen commented 5 years ago

I'm not seeing those emoji myself on my OS, but by copying and pasting I get the following results:

Otter

GET my_index/_analyze
{
  "text": "🦦 ",
  "analyzer": "english_with_emoji"
}

---

{
  "tokens": [
    {
      "token": "🦦",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "fishing",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "otter",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "playful",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

Accessibility

GET my_index/_analyze
{
  "text": "🧏 ",
  "analyzer": "english_with_emoji"
}

---

{
  "tokens": [
    {
      "token": "🧏",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "accessibility",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "deaf",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "deaf",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "ear",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "hear",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "person",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 1
    }
  ]
}

Indecisive

GET my_index/_analyze
{
  "text": "🧇 ",
  "analyzer": "english_with_emoji"
}

---

{
  "tokens": [
    {
      "token": "🧇",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "indecisive",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "iron",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "waffle",
      "start_offset": 0,
      "end_offset": 2,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

I think this looks good, right?

Let me know if I can do some further analysis.

damienalexandre commented 5 years ago

Looks good to me, thank you for your contribution!

jolicode / emoji-search

Upgrade to version 35 #23