opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[FEATURE] Reuse KNNVectorFieldData for reduce disk usage #1572

Open luyuncheng opened 8 months ago

luyuncheng commented 8 months ago

Description

in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)

"mappings": { 
  "_source": { 
    "excludes": [
      "target_field1",
      "target_field2",
     ]
  }
}

so I propose to use doc_values field for the vector fields. like:

POST some_index/_search
{
  "docvalue_fields": [
    "vector_field1",
    "vector_field2",
  ],
  "_source": false
}'

Proposal

  1. Rewrite KNNVectorDVLeafFieldData get data from docvalues

    i rewrite KNNVectorDVLeafFieldData and make KNN80BinaryDocValues can return the specific knn docvalue_fields like: (vector_field1 is knn field type)

"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]

optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat

  1. composite vector field to _source

I added KNNFetchSubPhase and add a processor like FetchSourcePhase#FetchSubPhaseProcessor to combine the docvalue_fields into _source something like synthetic logic

Do you have any additional context? This talk at issue #1087 and there is some other ideas My PR is #1571

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like #1087 we can use KnnVectorsFormat.

BUT The idea I want to show is just reduce the disk usage and there is a issue https://github.com/opensearch-project/OpenSearch/issues/6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a synthetic way

jmazanec15 commented 6 months ago

I think we are going to need to push this to 2.15.