Closed harmenjanssen closed 5 years ago
So we can't really upgrade to version 35. If the synonym contains a token that the "icu_tokenizer" can't process / doesn't understand, it could produce an error (last time I updated it was the case).
Maybe you could test with a new emoji?
🧏 => 🧏, accessibility, deaf, deaf person, ear, hear
🦦 => 🦦, otter, speel, visvang
🧇 => 🧇, indecisive, iron, waffle
What happens if you try those with the _analyze
API + your synonym files?
Thanks!
I'm not seeing those emoji myself on my OS, but by copying and pasting I get the following results:
Otter
GET my_index/_analyze
{
"text": "🦦 ",
"analyzer": "english_with_emoji"
}
---
{
"tokens": [
{
"token": "🦦",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "fishing",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "otter",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "playful",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
}
]
}
Accessibility
GET my_index/_analyze
{
"text": "🧏 ",
"analyzer": "english_with_emoji"
}
---
{
"tokens": [
{
"token": "🧏",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "accessibility",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "deaf",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "deaf",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "ear",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "hear",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "person",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 1
}
]
}
Indecisive
GET my_index/_analyze
{
"text": "🧇 ",
"analyzer": "english_with_emoji"
}
---
{
"tokens": [
{
"token": "🧇",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "indecisive",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "iron",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "waffle",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
}
]
}
I think this looks good, right?
Let me know if I can do some further analysis.
Looks good to me, thank you for your contribution!
I upgraded the synonyms lists per the instructions: updated version in the PHP script and ran
php build-released.php
.Thanks for maintaining this repo!