lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Error: binaryEmbeddingReader can't be null #6

Closed ltung-cit closed 6 years ago

ltung-cit commented 6 years ago

I'm using Elasticsearch as docker container with the binary-vector-scoring plugin installed, but I'm getting an intermittent error when doing search with the following query:

{
  "function_score": {
    "boost": 1,
    "score_mode": "avg",
    "boost_mode": "multiply",
    "min_score": 0,
    "script_score": {
      "script": {
        "source": "binary_vector_score",
        "lang": "knn",
        "params": {
          "cosine": true,
          "field": "image_embedding",
          "vector": "MY_VECTOR_HERE"
        }
      }
    }
  }
}

The search runs ok for a while (first dozen of requests) and then it starts returning the following error:

Caused by: java.lang.IllegalStateException: binaryEmbeddingReader can't be null
elasticsearch    |  at com.liorkn.elasticsearch.script.VectorScoreScript.setBinaryEmbeddingReader(VectorScoreScript.java:67) ~[?:?]
elasticsearch    |  at com.liorkn.elasticsearch.service.VectorScoringScriptEngineService$1.getLeafSearchScript(VectorScoringScriptEngineService.java:65) ~[?:?]
elasticsearch    |  at org.elasticsearch.common.lucene.search.function.ScriptScoreFunction.getLeafScoreFunction(ScriptScoreFunction.java:79) ~[elasticsearch-5.6.0.jar:5.6.0]
elasticsearch    |  at org.elasticsearch.common.lucene.search.function.FunctionScoreQuery$CustomBoostFactorWeight.functionScorer(FunctionScoreQuery.java:140) ~[elasticsearch-5.6.0.jar:5.6.0]
...

Reindexing all documents is the only way to make the search work again, has anybody faced the same problem?

lior-k commented 6 years ago

this error happens when the field ("image_embedding" in your case) does not exist in all the documents you are searching on.

ghost commented 6 years ago

Same error. I used the field "embedding_vector", and it exists in my document I'm searching on.

ltung-cit commented 6 years ago

Hi @lior-k The field (image_embedding) also exists in my document.

I have an indice with 10 shards and I realized that when search does return hits, there's a JSON in the response with the property shards:

{
  "successful": 3,
  "failed": 7,
  "skipped": 0,
  "total": 10,
  "failures": [
    {
      "node": "ghr7DWYOSWa4tlvZ4kpsFQ",
      "index": "deckito",
      "reason": {
        "reason": "binaryEmbeddingReader can't be null",
        "type": "illegal_state_exception"
      },
      "shard": 0
    }
  ]
}

When setting shards to a low number (below 3), the error occurs more often.

nabas commented 6 years ago

I also have the same problem, the document has the field but the problem happens

lior-k commented 6 years ago

Please share:

  1. Your index mapping
  2. A query that checks that ALL the documents in the index have the field

On Fri, Apr 27, 2018, 3:46 PM nabas notifications@github.com wrote:

I also have the same problem, the document has the field but the problem happens

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/lior-k/fast-elasticsearch-vector-scoring/issues/6#issuecomment-384959794, or mute the thread https://github.com/notifications/unsubscribe-auth/AExkSDpJAlQSakYzjMkxtg_aOu-Bfhvyks5tsxM_gaJpZM4TlMCk .

ltung-cit commented 6 years ago

Hi @lior-k

This is my mapping:

{
    "settings": {
        "number_of_shards": 10
    },
    "mappings": {
        "slide": {
            "properties": {
                "deck_id": {
                    "type": "keyword",
                    "index": true
                },                
                "number": {
                    "type": "integer",
                    "index": true
                },
                "image_embedding": {
                    "type": "binary",
                    "doc_values": true
                },
                "text": {
                    "type": "text",
                    "index": true
                }
            }
        },
        "searchResult": {
            "properties": {
                "deck_id": {
                    "type": "keyword",
                    "index": true
                },
                "search_timestamp": {
                    "type": "date",
                    "index": true
                },
            }
        }
    }
}

My query:

{
  "query": {
    "bool": {
      "should": [
        {
          "function_score": {
            "boost": 1,
            "score_mode": "avg",
            "boost_mode": "multiply",
            "min_score": 0,
            "script_score": {
              "script": {
                "source": "binary_vector_score",
                "lang": "knn",
                "params": {
                  "cosine": true,
                  "field": "image_embedding",
                  "vector": "MY_VECTOR"
                }
              }
            }
          }
        }
      ]
    }
  }
}

MY_VECTOR is something like [0.20438875, 0.087035105, 0.41949105, ...]

I'm using the Python client to search only documents of type slide, which have the field "image_embedding" in all of them:

result = self.client.search(index='deckito', doc_type='slide', from_=0, size=3, body=query, version=True, _source_include=['deck_id', 'number', 'image_embedding'])
lior-k commented 6 years ago

please do the following query in order to check that all the documents have values in this field. meaning this query should return 0 documents:

GET <es-url>/<index>/_search
{
    "query": {
        "bool" : {
            "must" : {
                "script" : {
                    "script" : {
                        "inline": "doc.image_embedding == null || doc.image_embedding.value == null || doc.image_embedding.value == ''",
                        "lang": "painless"
                     }
                }
            }
        }
    }
}
MannBITS commented 6 years ago

Hi @lior-k

I am also getting the same error: "{ "took" : 33, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 4, "skipped" : 0, "failed" : 1, "failures" : [ { "shard" : 3, "index" : "indexvectors", "node" : "Q5VeFkIvQh6KLS6PQsUg2w", "reason" : { "type" : "illegal_state_exception", "reason" : "binaryEmbeddingReader can't be null" } } ] }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } } "

my data looks like: { "indexvectors" : { "aliases" : { }, "mappings" : { "vectordocs" : { "properties" : { "embedding-vector" : { "type" : "binary", "doc_values" : true }, "id" : { "type" : "text" }, "vector" : { "type" : "text" } } } }, "settings" : { "index" : { "creation_date" : "1524853637835", "number_of_shards" : "5", "number_of_replicas" : "1", "uuid" : "76m277CESNiYnovi6n6Q8A", "version" : { "created" : "5060099" }, "provided_name" : "indexvectors" } } } }

I have just added one record and used the same records vector field in query to get knn with k=1. Ideally the query should have returned the record present in the index but instead I got the above mentioned error. Could you help me out here?

ltung-cit commented 6 years ago

Hi @lior-k

I ran the query you posted in 3 different ways and it returned the following results (note I have 2 document types: slide and searchResult and the property image_embedding is only declared for type slide):

MannBITS commented 6 years ago

I was able to get the issue resolved by following lior-k's suggestion and making sure that 0 docs are returned for the query mentioned. I am able to get the KNN docs now using the plugin. Thanks @lior-k :-)

ghost commented 6 years ago

I fixed my templates, and reindexed them, finally it works. Before fixing, I used different field names between templates and documents, but it should be same. And also, I defined the "embbeding_vector" field as "text", but it should be "binary".

lior-k commented 6 years ago

good to hear, closing the issue

tgreiser commented 5 years ago

Also struggling with this problem. The plugin works in production, but when I use elasticdump to copy the data to a local server I start getting "binaryEmbeddingReader can't be null".

elasticdump --input=./account_mapping.json --output=http://localhost:9200/account --type=mapping
elasticdump --input=./account.json --output=http://localhost:9200/account --type=data

In this state my vector searches fail entirely. If I inspect the mapping my field is mapped correctly. If I use the painless query above I find 0 records. If I reindex my document then things start working on most of the shards.

POST http://localhost:9200/_reindex
{
  "source": {
    "index": "account"
  },
  "dest": {
    "index": "tmp"
  }
}

Then I do a second _reindex to rename from tmp back to account. My queries start working now, however - I still see exceptions firing in the ES server and my query _shards has 3 successful and 2 failed shards:

"_shards": {
        "total": 5,
        "successful": 3,
        "skipped": 0,
        "failed": 2,
        "failures": [
            {
                "shard": 0,
                "index": "account",
                "node": "HlfEVuX_TbO8u6GXu47REQ",
                "reason": {
                    "type": "illegal_state_exception",
                    "reason": "binaryEmbeddingReader can't be null"
                }
            }
        ]
    },

Update: After about 15 minutes and a few reboots, the two buggy shards started working and I am getting 5/5 successful now. So if anyone else has the same problem - import, reindex and then wait a while while shards rebuild.