jolicode / emoji-search

:smile: Emoji synonyms to build your own emoji-capable search engine (elasticsearch, solr, OpenSearch)
https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji
MIT License
218 stars 64 forks source link

Investigate the plugin usefulness in the light of how ICUTokenizer works #21

Closed damienalexandre closed 2 years ago

damienalexandre commented 6 years ago

Looks like we could just provide a new Rule File instead of tricking the ICU Tokenizer.

As this code show:

https://github.com/apache/lucene-solr/blob/23bff7dbc207083af2ccb1b308c121ac18c36508/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.java#L116-L125

The default config is used when there is no file for the current "script" (which was a fear I had about this possibility to change de Rbbi).

What the plugin could do then:

damienalexandre commented 6 years ago

Tried to import https://github.com/apache/lucene-solr/blob/4522e45bdadd4268a9270135130fc28a7f46c627/lucene/analysis/icu/src/data/uax29/Default.rbbi as custom rbbi config, looks like it's ok, but the error following show that there may be some bad word breaking.

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "failed to build synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: 😀 was completely eliminated by analyzer"
      }
    }
  },
  "status": 400
}
damienalexandre commented 2 years ago

The plugin is not needed anymore.