o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch
http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/
Apache License 2.0
1.47k stars 371 forks source link

Dense Vector Feature as a param to a Mustache script score template #465

Open razilevin opened 1 year ago

razilevin commented 1 year ago

Trying to use embeddings to compute cosine similarity. The problem I am getting is there no way to pass the embedding as a param to invoke the following feature during logging.

{
    "name": "vector_simularity",
    "params": [
        "embedding"
    ],
    "template_language": "mustache",
    "template": {
        "function_score": {
            "script_score": {
                "script": {
                    "source": "1 + cosineSimilarity(params.query_vector, doc['base_name_vector'])",
                    "params": {
                        "query_vector": "{{#toJson}}embedding{{/toJson}}"
                    }
                }
            }
        }
    }
}

I got the idea to use toJson mustache template from another post which seems to match what I am tying to do https://github.com/o19s/elasticsearch-learning-to-rank/issues/338

I get the following error when running the query

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "1 + cosineSimilarity(params['query_vector'], doc['base_name_vector'])",
          "                           ^---- HERE"
        ],
        "script": "1 + cosineSimilarity(params['query_vector'], doc['base_name_vector'])",
        "lang": "painless",
        "position": {
          "offset": 27,
          "start": 0,
          "end": 69
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "semantic_search",
        "node": "57-FXL1dQwOjKxaOn62-Dw",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "1 + cosineSimilarity(params.query_vector, doc['base_name_vector'])",
            "                           ^---- HERE"
          ],
          "script": "1 + cosineSimilarity(params.query_vector', doc['base_name_vector'])",
          "lang": "painless",
          "position": {
            "offset": 27,
            "start": 0,
            "end": 69
          },
          "caused_by": {
            "type": "class_cast_exception",
            "reason": "class java.lang.String cannot be cast to class java.util.List (java.lang.String and java.util.List are in module java.base of loader 'bootstrap')"
          }
        }
      }
    ]
  },
  "status": 400
}

Please note a query like the following works as expected

{
  "query": {
    "size": 36,
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['base_name_vector']) + 1.0",
        "params": {
          "queryVector": query_embedding
        }
      }
    }
  }
}
razilevin commented 1 year ago

Hacked like this to make work. Here is the definition of the feature. Any feedback?

{
                "name": "vector_simularity",
                "params": [
                    "embedding"
                ],
                "template_language": "mustache",
                "template": {
                    "function_score": {
                        "script_score": {
                            "script": {
                                "source": """
                                List parseArrayOfFloats(def aryOfFloats) { 
                                    def x = aryOfFloats.substring(1, aryOfFloats.length() - 1);
                                    def z = new StringTokenizer(x, ",");
                                    def y = new ArrayList();

                                    while(z.hasMoreTokens()) {
                                        y.add(Float.parseFloat((String)z.nextToken()));
                                    }

                                    return y;
                                }

                                return cosineSimilarity(parseArrayOfFloats(params.query_vector), 'base_name_vector') + 1.0;
                                """,
                                "params": {
                                    "query_vector": "{{#toJson}}embedding{{/toJson}}"
                                }
                            }
                        }
                    }
                }
            }
jhinch-at-atlassian-com commented 12 months ago

I believe that the problem is that the original query is not structured correctly. The template can be a deeply nested query, or it can be a string. In order to have the toJson work correctly, it needs to be a string:

{
    "name": "vector_simularity",
    "params": [
        "embedding"
    ],
    "template_language": "mustache",
    "template": "{\"function_score\": {\"script_score\": {\"script\": {\"source\": \"1 + cosineSimilarity(params.query_vector, doc['base_name_vector'])\", \"params\": {\"query_vector\": {{#toJson}}embedding{{/toJson}}}}}}"
    }
}

Note that all the " are escaped and the {{#toJson}}embedding{{/toJson}} is not enclosed in quotes