o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch
http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/
Apache License 2.0
1.47k stars 371 forks source link

Unable to parameterize non-string values (e.g. embedding vectors) when building mustache feature template #338

Open DIXLTICMU opened 3 years ago

DIXLTICMU commented 3 years ago

As we know that beginning ES 7.x they introduced better support for embedding vectors, and it also provide utility functions in painless to support vector arithmetics such as cosineSimilarity... https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-script-score-query.html#vector-functions

This is huge to us... this support is immensely powerful for LTR, as we can now rely on embedding distances between a seed embedding transformed from the search query and the pre-indexed embeddings of the documents. For example, with this field in my index:

{
  "words" : {
    "type" : "nested",
    "properties" : {
      "raw" : {
        "type" : "keyword"
      },
      "text" : {
        "type" : "text"
      },
      "embedding" : {
        "type" : "dense_vector",
        "dims" : 64
      }
    }
  }
}

I can do an ES query such as

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "query": {
              "function_score": {
                "query": {
                  "exists": {
                    "field": "words.embedding",
                    "boost": 1
                  }
                },
                "functions": [
                  {
                    "filter": {
                      "exists": {
                        "field": "words.embedding",
                        "boost": 1
                      }
                    },
                    "script_score": {
                      "script": {
                        "source": "0.5 * cosineSimilarity(params['query_vector'], doc['words.embedding']) + 0.5",
                        "lang": "painless",
                        "params": {
                          "query_vector": [
                            -0.79067802429199,
                            0.074841000139713,
                            1.2398209571838,
                            -0.2216549962759,
                            0.014894000254571,
                            ......
                          ]
                        }
                      }
                    }
                  }
                ],
                "boost": 1
              }
            },
            "path": "words",
            "ignore_unmapped": false,
            "score_mode": "avg",
            "boost": 1
          }
        }
      ]
    }
  }
}

And ES executes it extremely efficiently to the extend that we can definitely make it an LTR feature... (In my case, calculating the cosine between 64D seed embedding and 10 pre-indexed embeddings for 10M documents took me only 300ms with 30 machines and 30 shards). In reality the workload is much smaller...

But unfortunately I found a very serious issue around they way we do I/O around mustache based features... The only way I can implement my feature based on my query above is via the mustache feature template, and the issue is.. every parameter handled by mustache template engine gets transformed into strings... or list of strings as Mustache have some crappy support for list types (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-template.html#_passing_an_array_of_strings)... Here is the feature I am attempting to build:

{
  "validation": {
    "index": "my_index",
    "params": {
      "query_embedding": [
        -0.79067802429199,
        0.074841000139713,
        1.2398209571838,
        -0.2216549962759,
        ......
      ]
    }
  },
  "featureset": {
    "features": [
      {
        "template_language": "mustache",
        "name": "query_to_document_words_embedding_distance_feature",
        "params": [
          "query_embedding"
        ],
        "template": {
          "nested": {
            "path": "words",
            "query": {
              "function_score": {
                "query": {
                  "exists": {
                    "field": "words.embedding",
                    "boost": 1
                  }
                },
                "boost": 1,
                "functions": [
                  {
                    "filter": {
                      "exists": {
                        "field": "words.embedding",
                        "boost": 1
                      }
                    },
                    "script_score": {
                      "script": {
                        "lang": "painless",
                        "params": {
                          "query_embedding": [
                            "{{#query_embedding}}",
                            "{{.}}",
                            "{{/query_embedding}}"
                          ]
                        },
                        "source": " Debug.explain(params.query_embedding); 0.5 * cosineSimilarity(params['query_embedding'], doc['words.embedding']) + 0.5""
                      }
                    }
                  }
                ]
              }
            },
            "ignore_unmapped": false,
            "score_mode": "avg",
            "boost": 1
          }
        }
      }
    ]
  }
}

The Debugger outputs something like this:

"painless_class": "java.util.ArrayList",
"to_string": "[, -0.79067802429199, , 0.074841000139713, , 1.2398209571838, , -0.2216549962759, , 0.014894000254571, , 1.3756049871445, , ........, ]",
"java_class": "java.util.ArrayList",

As you may notice that the templating didn't even do a good enough job of transforming the embedding numbers to string lists as it injected empty stings alternating with non empty strings.

After hours of research I realized we cannot handle such cases given the current implementation... that is, we are only allowing users to define mustache features as actual json objects, and we can never get away from mustache converting has-to-be-non-string parameter values to strings.

You may suggest that I cast string to doubles using painless, and I did exactly that.. Painfully, the performance drops from 300ms to 23 seconds... a close to 800% slow down...

But I think there is a way forward... fortunately, in ES 7.x mustache supports converting to json... so if I change my script to this:

                     "script": {
                        "lang": "painless",
                        "params": {
                          "query_vector": "{{#toJson}}query_embedding{{/toJson}}"
                        },
                        "source": " Debug.explain(params.query_vector);"
                      }

Mustache will transform the parameter into a stringified json array:

"painless_class": "java.lang.String",
 "to_string": "[-0.79067802429199,0.074841000139713,1.2398209571838, ...... ,0.093773998320103]",
"java_class": "java.lang.String",

My hunch is that we need to also support defining features templates as "stringified mustache", such that during feature bootstrap, we run the mustache engine upon the stringified template before converting the whole thing to json.

This will allow us to bypass the natural constraint introduced by having to pass a valid json object for the mustache feature template where we must add quotations around places where we should not. To put it quite simply, this allows us to get rid of the quotations that prevents us from loading the feature query accurately. Without the quotations, the json vector [-0.79067802429199,0.074841000139713,1.2398209571838, ...... ,0.093773998320103] can be elegantly restored as is, and then the whole thing gets parsed into a json object then sent to ES for score generation, otherwise, it will simply be recognized as a string...

Alternatively, we can allow extra config on each parameter type, such that we perform an extra level of json decoding on certain parameters that are declared as "json"... but I would argue this to be a much more expensive solution whereas the first one seems to be far less strings attached as we then assume developers' full responsibility of getting things right.

If any one deem this an important and worth fixing issue and believe in my theory, I would appreciate some help here for an workaround or point me to where I can implement my suggested change.

DIXLTICMU commented 3 years ago

Follow up... I managed to unblock myself by hacking into MustacheUtils.java with some creepy string hacks...

From the looks of the implementations around toJson, it appears that we are relying on 3rd party libraries and we will always render a json object as an string in the eventual parameter field, instead of allowing it to be an actual json element with arbitrary complexity...

good news is that the fix appears to be more self contained and local.. bad news is that this needs a non-trivial revamp if we want a good comprehensive fix.

nathancday commented 3 years ago

Thanks for flagging this issue. I'll craft a reproducible example of the JSON mis-representation that seems to be happening in your use case with dense vectors. My gut is you are on the right track with the toJson step being the problem, perhaps we can find an alternative parameter config or library to prevent the string coercion from happening.

manu-dikzit commented 2 years ago

This is affecting using "terms" queries as features as well. I am unable to pass in an array of terms without the support of {{#toJson}}var_name{{\toJson}} template. Is there any workaround to this? @nathancday

nathancday commented 2 years ago

Thanks for bumping this. I think that is the work around for now. I'll try to pick this back up next week to re familiarize myself what the problem, adding your terms problem to the dense vector problem above.