opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[BUG] Runtime error when using explain=true with multiple script_score neural queries (Null score for the docID: 2147483647) #2176

Open tallytarik opened 1 year ago

tallytarik commented 1 year ago

What is the bug?

When:

Then, if a document is returned by some neural field queries (within the sub-query's top-k) but not some others, the query fails with a script runtime exception and the error: Null score for the docID: 2147483647

(At least I think this is why... I'm new to OpenSearch and neural search, so apologies - my explanation for why this happens is just my best guess!)

How can one reproduce the bug?

GET /myindex/_search?explain=true
{
  "from": 0,
  "size": 100,
  "query": {
    "bool" : {
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "title_embedding": {
                  "query_text": "test",
                  "model_id": "xGbq_YcB3ggx1CR0Nfls",
                  "k": 10
                }
              }
            },
            "script": {
              "source": "_score * 1"
            }
          }
        },
        {
          "script_score": {
            "query": {
              "neural": {
                "description_embedding": {
                  "query_text": "test",
                  "model_id": "xGbq_YcB3ggx1CR0Nfls",
                  "k": 10
                }
              }
            },
            "script": {
              "source": "_score * 1"
            }
          }
        }
      ]
    }
  }
}

See an error like:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
          "org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
          "org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
          "_score * 1",
          "^---- HERE"
        ],
        "script": "_score * 1",
        "lang": "painless",
        "position": {
          "offset": 0,
          "start": 0,
          "end": 10
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "opensearch_content",
        "node": "vnyA5s-aQUOmTj6IHosYXA",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.opensearch.knn.index.query.KNNScorer.score(KNNScorer.java:51)",
            "org.opensearch.script.ScoreScript.lambda$setScorer$4(ScoreScript.java:156)",
            "org.opensearch.script.ScoreScript.get_score(ScoreScript.java:168)",
            "_score * 1",
            "^---- HERE"
          ],
          "script": "_score * 1",
          "lang": "painless",
          "position": {
            "offset": 0,
            "start": 0,
            "end": 10
          },
          "caused_by": {
            "type": "runtime_exception",
            "reason": "Null score for the docID: 2147483647"
          }
        }
      }
    ]
  },
  "status": 400
}

Note the high size and low k. You might need to adjust the query_text or k to find a combination where a document is returned in one neural query's top k and not the other.

Remove explain=true from the query and notice it succeeds.

What is the expected behavior?

What is your host/environment?

OpenSearch 2.7, Ubuntu 22.04.

Do you have any additional context?

I'm not sure why it only happens with explain=true. (I can't explain it)

It also only happens if using script_score. If using multiple neural queries directly, there is no error. But then there is no score per-field in _explanation - the total is correct, but each field score value is reported as 1. https://github.com/opensearch-project/k-NN/issues/875 describes this problem. My use case is: I'd like to try using the similarity scores of each field as features in a Learning to Rank model, which means I need to get each score individually.

tallytarik commented 1 year ago

Just to add, I'm using nmslib with a field mapping like this:

    "title_embedding": {
      "type": "knn_vector",
      "dimension": 384,
      "method": {
        "name": "hnsw",
        "space_type": "l2",
        "engine": "nmslib",
        "parameters": {
          "ef_construction": 128,
          "m": 24
        }
      }
    }

I've just tested using the Lucene engine and the error does not occur with lucene. (as an aside, with lucene, the _explanation values are all filled in properly without having to use a script for the neural queries)

martin-gaievski commented 1 year ago

Explain logic not really supported in both neural-search and knn (that does the work under the hood). In neural-search explain functionality is not implemented, and knn has a mock implementation that returns a constant KNNWeight.

Most probably the error you're seeing is a result of those mock results bubbled to the high level query like bool. While we should investigate the error, most like the explain not be fixed in a nearest future.

tallytarik commented 1 year ago

Thanks @martin-gaievski!

I'm ok if there is no detailed explain logic. My bug is just about the error being thrown when using the _score value, which means you can't use explain at all, which means you can't get the calculated distance per-field.

For example, here is what's shown for a successful query (with explain=true but without the bug described above):

{
  "hits": {
    "max_score": 0.81763434,
    "hits": [
      {
        <trimmed>
        "_score": 0.81763434,
        "_source": {},
        "_explanation": {
          "value": 0.81763434,
          "description": "sum of:",
          "details": [
            {
              "value": 0.42870614,
              "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='_score * 1', options={}, params={}}\"",
              "details": [
                {
                  "value": 1,
                  "description": "_score: ",
                  "details": [
                    {
                      "value": 1,
                      "description": "No Explanation",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0.38892817,
              "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='_score * 1', options={}, params={}}\"",
              "details": [
                {
                  "value": 1,
                  "description": "_score: ",
                  "details": [
                    {
                      "value": 1,
                      "description": "No Explanation",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

I think you're referring to the details showing a constant 1 and No Explanation. That part is okay because I was trying to work around it in a different way - by using a script_score query and referring to _score.

So in this example, that works! The important field is _explanation -> details -> value - 0.42870614 and 0.38892817. These are the individual score values for the two neural queries. We can ignore the constant/'No explanation' details further down.

But that revealed the bug: referring to _score in a script_score + neural query will sometimes throw an error when explain=true.

The reason I think it's a bug is because it only throws an error when explain=true. Without explain, there is no error, and the document combined _score is as expected. That doesn't make sense to me, because I'd expect the line in KNNScorer::score() that throws the error with explain=true to also throw the error when calculating the document score. But that doesn't seem to be the case.

So overall, without explain I don't think it's possible to get the distance score of each vector field separately. The best we can get is the document's combined _score.

At least I think that is correct? If there's a different way of running multiple neural queries on multiple fields and getting the score of each one separately, I would love to know!

For now, I'm ok using the Lucene engine, where the bug doesn't occur. (In Lucene, if a doc appears in the top-k of one query but not the other, the non-matching query won't have any entry in _explanation, which is what I expect)

jmazanec15 commented 1 month ago

transferring to knn as it seems its a knn issue

dblock commented 1 month ago

[Catch All Triage - 1, 2, 3, 4]

jmazanec15 commented 1 month ago

@vamshin could you help add assignee?