Closed harmenjanssen closed 4 years ago
Hi!
There is no need for the plugin with Elasticsearch version >= 6.4 as the ICU library has been updated.
So with your 7.8 you just have to install the "analysis-icu" plugin (because you need to use icu_tokenizer) and use the dictionary as synomym token filter.
Something like this:
PUT /emoji-capable
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "icu_tokenizer",
"filter": [
"english_emoji"
]
}
}
}
}
}
I suggest this blog post for more information: https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji
Hmm, then maybe my question is wrong, haha.
I got the following error when creating the index:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","caused_by":{"type":"parse_exception","reason":"Invalid synony
m rule at line 1","caused_by":{"type":"illegal_argument_exception","reason":"term: \uD83C\uDFFB was completely eliminated by analyzer"}}},"status":400}
and assumed from that other thread I would need your plugin to fix this. But am I right in concluding the dictionary file can be used when I configure the ICU tokenizer?
Yes, I suspect you didn't use ICU at all when you got this error?
That's true.
Thanks for getting back to me so quickly, I'm sure I can make it work. 🙂
We did make it work, eventually!
However, our client reported a strange bug in which the query "🍏☀️" would yield results, but "☀️🍏" would not.
Upon inspection, the first emoji is converted into synonyms, but the second one isn't:
GET /stedelijk_nl/_analyze
{
"analyzer": "dutch_with_emoji",
"text": "🍏☀️️️"
}
Response:
{
"tokens" : [
{
"token" : """🍏""",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "appel",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "fruit",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "groen",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "groen",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "☀",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<EMOJI>",
"position" : 1
},
{
"token" : "appel",
"start_offset" : 2,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 12
}
]
}
When I flip the order of emoji, the ☀️ will be converted to synonyms, but the apple is not — very odd behavior. Have you ever seen anything like this?
For the record:
Other than that there are some stemming and stopwords filters, but I've removed all of these and it doesn't seem to make a difference.
Thanks for reporting this issue.
I have some questions:
☀
? Or do you remove this via a char filter ?uniscribe "🍏☀<fe0f><fe0f><fe0f>"
1F34F ├─ 🍏 ├─ GREEN APPLE
---- ├┬ ☀️️️ ├┬ Composition
2600 │├─ ☀ │├─ BLACK SUN WITH RAYS
FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
FE0F │└─ VS16 │└─ VARIATION SELECTOR-16
VARIATION SELECTOR-16 is used to force the EMOJI version of ☀
but it's only needed once.
did you edit the synonym file to remove ☀ ? Or do you remove this via a char filter ?
I remove it via a char filter, type mapping
, with mappings like this:
'*=>star',
'✓=>checkmark',
the submitted string looks strange (copy pasted from your _analyze call):
I agree, it does! I inserted it into Kibana using the standard MacOS emoji picker. Upon insertion it changed into the more "text-like" sun thing you see in my code snippet.
However, the same thing happens with an avocado (which does look like an actual emoji):
GET /stedelijk_nl/_analyze
{
"analyzer": "dutch_with_emoji",
"text": "🍏🥑️️️"
}
{ "tokens" : [ { "token" : """🍏""", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "appel", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "fruit", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "groen", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "groen", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : """🥑""", "start_offset" : 2, "end_offset" : 7, "type" : "", "position" : 1 }, { "token" : "appel", "start_offset" : 2, "end_offset" : 7, "type" : "SYNONYM", "position" : 1 } ] }
Just made some tests.
The invisible char you have in the string (FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
) is not understood by Elasticsearch standard analyzer, neither by icu_tokenizer.
So we need to clean off that emoji variation selector before giving them to the synonym token filter.
This can be done like this:
"emoji_variation_selector_filter": {
"type": "pattern_replace",
"pattern": "\\uFE0E|\\uFE0F",
"replace": ""
}
Your search is 🍏🥑<fe0f><fe0f><fe0f>
, it produce two tokens by default:
🍏
🥑<fe0f>
As 🥑<fe0f>
is not in the synonym file you don't get the annotations.
When we apply the above filter we get those tokens:
🍏
🥑
And then the synonym filter can work to add the tokens!
I added this filter in the README, added tests and I'm now closing this issue. Feel free to comment if there is anything else!
See changes here: https://github.com/jolicode/emoji-search/commit/bea5b31d96ac641ebe6eace8da07ff0ff610bc2c
Also since last time the emoji files have been fixed for the "completely eliminated by analyzer" issue :wink:
That's great! Thanks so much for maintaining this repo and debugging this issue. I will implement the filter and download the new dictionary files.
Oddly enough I still get an error on the line:
〽 => 〽, mark, part, part alternation mark
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"failed to build synonyms"}],"type":"illegal_argument_exception","reason":"failed to build synonyms","c
aused_by":{"type":"parse_exception","reason":"Invalid synonym rule at line 1263","caused_by":{"type":"illegal_argument_exception","reason":"term: 〽 was completely eliminate
d by analyzer"}}},"status":400}
It might be relevant to know we do not use the file as-is but add them programmatically through synonyms
:
'filter' => [
'english_emoji' => [
'type' => 'synonym',
'synonyms' => [], // Will be filled by reading the synonyms file.
],
// @see https://github.com/jolicode/emoji-search/issues/26
'emoji_variation_selector_filter' => [
'type' => 'pattern_replace',
'pattern' => '\\uFE0E|\\uFE0F',
'replace' => ''
],
...
Do you think that makes a difference? We're using ES 7.8.0 — your tests are succeeding on 7.8.1 so I'm assuming 7.8.0 should work as well...
Ah, got it: when using tokenizer icu_tokenizer
it fails, but with tokenizer standard
it works.
I was under the impression I had to use icu_tokenizer
.
So the synonyms file is correct right now, it imports correctly when using icu_tokenizer
.
However, the original problem of translating subsequent emoji into synonyms still does not work, look:
Request:
GET /stedelijk_staging_en/_analyze
{
"analyzer": "english_with_emoji",
"text": "🍏🥑️️️️"
}
My english_with_emoji
analyzer is setup like this:
'english_with_emoji' => [
'char_filter' => [
'html_strip'
],
'tokenizer' => 'standard',
'filter' => [
'english_emoji',
'emoji_variation_selector_filter',
'lowercase',
'english_stop',
'english_stemmer',
],
],
Any ideas?
Hi! Long time no see :wave:
I took the time to test with icu_tokenizer
and there was some emoji to remove, just opened #33 for that, thanks! (it's fully tested not with both the standard and icu tokenizers).
About your other issue, here is the full string you search:
1F34F ├─ 🍏 ├─ GREEN APPLE
---- ├┬ 🥑️️️️ ├┬ Composition
1F951 │├─ 🥑 │├─ AVOCADO
FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
FE0F │├─ VS16 │├─ VARIATION SELECTOR-16
FE0F │└─ VS16 │└─ VARIATION SELECTOR-16
As you can see we have a lot of VARIATION SELECTOR.
For that you added the emoji_variation_selector_filter
but you put it after the english_emoji
token filter, it must be before.
'english_with_emoji' => [
'char_filter' => [
'html_strip'
],
'tokenizer' => 'standard',
'filter' => [
'emoji_variation_selector_filter',
'english_emoji',
'lowercase',
'english_stop',
'english_stemmer',
],
],
Have a great day ;-)
Yes! That works fantastically! Thanks, that was easier than I expected.
I'm going to recreate my mappings and re-index. Thanks again!
Hi @damienalexandre,
We're having the same issue you describe here, when using the dictionary file from this repository.
We've just upgraded from ElasticSearch 5.3 to 7.8 however, so we can't use your plugin to solve this issue (yet). Is a 7.8 release on your roadmap by any chance?
Thanks in advance!