opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
154 stars 113 forks source link

[BUG] OS updates wiping knn_vector field when excluded from _source #1694

Closed claire-chiu-figma closed 1 month ago

claire-chiu-figma commented 4 months ago

What is the bug? I have an index with a knn_vector field that I excluded from _source. When I update a document in this index without specifying any value for the knn_vector field, the field gets wiped from the document (when I would expect the field to remain unchanged).

How can one reproduce the bug?

  1. Go to OS dashboards.
  2. Create an OS index with knn vector field:
    
    PUT /test
    {
    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "knn": "true"
      }
    },
    "mappings": {
      "dynamic": "strict",
      "_source": {
        "excludes": [
          "embedding"
        ]
      },
      "properties": {
        "embedding": {
          "type": "knn_vector",
          "dimension": 1,
          "method": {
            "engine": "faiss",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        },
        "creator_id": {
          "type": "keyword"
        },
        "file_id": {
          "type": "long"
        }
      }
    }

}

3. Create a document in the index.

PUT /test/_doc/1 { "creator_id": "2", "file_id": 22, "embedding":[0.1] }

4. Check to see if the knn vector field (embedding) exists on the document (the below command should return the single document).

GET test/_search { "query": { "bool": { "must": [ { "exists": { "field": "embedding" } } ] } } }

5. Update the document with an update on the non-knn vector field.

POST test-v6/_update/1 { "doc": { "file_id": 24 } }


7. Check to see if the document has exists with the knn_vector field by running the same command in step 4 (now returns no documents).

**What is the expected behavior?**
I would expect the knn_vector field to still exist after the update, because I have not made any changes to that field.

**What is your host/environment?**
Opensearch version 2.11, hosted on AWS

**Do you have any additional context?**
In an index where the knn_vector field (embedding) is NOT excluded from _source, this problem is not present.
navneet1v commented 4 months ago

@claire-chiu-figma as you are not storing the _source, the updates will lead to removal of vector. This is not a bug, but this is what removal of _source will happen.

and as you can see with the other experiment: In an index where the knn_vector field (embedding) is NOT excluded from _source, this problem is not present.

when _source is not removed the field is not getting removed.

navneet1v commented 4 months ago

We are working on another feature ref: https://github.com/opensearch-project/k-NN/pull/1571 where if you remove _source and then do updates too the vector field should not be removed.

@luyuncheng as you are the author of the PR. can you validate that removing the _source and then doing updates will work once your code is merged?

claire-chiu-figma commented 4 months ago

@navneet1v Thanks for the quick response -

This is not a bug, but this is what removal of _source will happen.

To further understand the implications of removing a field from _source - does this mean that for ANY field that is excluded from _source, when you run an update on the document, if the update does not specify a new value for that field, that field will get wiped? Why is that so?

navneet1v commented 4 months ago

if the update does not specify a new value for that field, that field will get wiped? Why is that so?

The reason is there is no way Opensearch has the way to recreate the whole document from scratch. The whole document gets stored in _source so if you remove it then update capability goes away.

claire-chiu-figma commented 4 months ago

I see, and https://github.com/opensearch-project/k-NN/pull/1571 would help resolve this issue even if the vector is excluded from _source, because it would store the vector in docvalue_fields, which can be pulled from during an update operation?

navneet1v commented 4 months ago

I see, and #1571 would help resolve this issue even if the vector is excluded from _source, because it would store the vector in docvalue_fields, which can be pulled from during an update operation?

So vectors are already stored in doc_values. What the above PR will do is it will ensure that vectors get pulled from doc values if they are not present in _source. I would like @luyuncheng to comment more as he is author of the PR.

claire-chiu-figma commented 4 months ago

@navneet1v apart from the PR that's being worked on, are there any other approaches to issuing partial updates without wiping the vector? Or is adding this back to _source the only option?

navneet1v commented 4 months ago

@claire-chiu-figma if you can create the whole source back again and use it in your update API that is the only way. Otherwise you have to enable the source.

navneet1v commented 4 months ago

@claire-chiu-figma can I go ahead and close this issue. As there is no bug and this is the expected behavior of the Opensearch.

luyuncheng commented 4 months ago

@navneet1v @claire-chiu-figma

when excluded from _source and do update operation, it goes to logic:

https://github.com/opensearch-project/OpenSearch/blob/14f1c43c108f378b13d109ade364216c082fb858/server/src/main/java/org/opensearch/index/engine/InternalEngine.java#L1311-L1318

it using lucene source to do update. as i know, in the original reference there is a warning that when exclude source, we can not use update, update_by_query, reindex APIs

and if we wan to use #1571 features, which is rewrite the FetchSubPhase, it can do reindex but not update the other field.

there is 2 scenarios:

  1. exclude vector, update vector field: OK
  2. exclude vector, update other field: Failed
jmazanec15 commented 1 month ago

Covered in #1572