in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)
for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat
composite vector field to _source
I added KNNFetchSubPhase and add a processor like FetchSourcePhase#FetchSubPhaseProcessor to combine the docvalue_fields into _source something like synthetic logic
Do you have any additional context?
This talk at issue #1087 and there is some other ideas
My PR is #1571
for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like #1087 we can use KnnVectorsFormat.
BUT The idea I want to show is just reduce the disk usage and there is a issue https://github.com/opensearch-project/OpenSearch/issues/6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a synthetic way
Description
in some scenarios, we want to
reduce the disk usage
andio throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)so I propose to use doc_values field for the vector fields. like:
Proposal
Rewrite
KNNVectorDVLeafFieldData
get data from docvaluesi rewrite
KNNVectorDVLeafFieldData
and makeKNN80BinaryDocValues
can return the specific knndocvalue_fields
like: (vector_field1
is knn field type)optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like this issue comments for redesign aKnnVectorsFormat
I added
KNNFetchSubPhase
and add a processor likeFetchSourcePhase#FetchSubPhaseProcessor
to combine thedocvalue_fields
into_source
something likesynthetic
logicDo you have any additional context? This talk at issue #1087 and there is some other ideas My PR is #1571
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like #1087 we can use KnnVectorsFormat.BUT The idea I want to show is just reduce the disk usage and there is a issue https://github.com/opensearch-project/OpenSearch/issues/6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a
synthetic
way