Closed jmazanec15 closed 1 month ago
When we add KNN support using native engine library(Faiss, NMSLib), there was no KnnVectorsFormat available in the Lucene library. So we used BinaryDocValuesFormat to store the vector data and added knn index using the native engine library while indexing doc value. For searching, we implemented our own query and utilized native engine index for ANN and doc value data for exact search.
Now as KnnVectorsFormat is available in the Lucene library, we are exploring an opportunity to move from DocValuesFormat to KnnVectorsFormat to see if there are any benefit of doing it.
With KnnVectorsFormat, we cannot store vector data and add native engine index along with it as we did with DocValuesFormat because it will create the Lucene vector data index which will be there for nothing but only consumes resources. Therefore, we need to override all the implementation extending VectorFormat class for both indexing and querying.
After comparing pros and cons of migration, we want to hold on it until we have more data points to support migration given a required effort for it.
Few points to note
Indexing When we use the Lucene engine, we store the vector data using either KnnByteVectorField or KnnFloatVectorField. If doc value is enabled, we store the vactor data in binary doc value format as well. Then, in the codec, we delegate both binary doc value and knn vector format to the Lucene library to handle. For native library, we always store the vector data in binary doc value format. Then, in the codec, we delegate binary doc value process to the Lucene library. After that, if the field is knn field, we run additional process to index the value using the native library, faiss, and nmslib. Native engine also uses doc value during merge.
Searching For the Lucene engine, we use existing KNNByteVectorQuery and KnnFloatVectorQuery. Both of them extends KnnVectorQuery class which has common methods to support vector query. By returning the query class, the rest of the process is handled by OpenSearch. the Lucene engine switch to exact search based on internal logic and it does not need doc value to be enabled to do the exact search.
For native engine, we have our own KNNQuery class. KNNQuery has its own logic to switch to exact search. For exact search to work, doc value is needed. ANN search works without doc value. Native engine library support vector data access by id but need to test how efficient it will be compared to doc value.
Compound file handling There is KNN80CompoundFormat. Its role is moving file generated by native engine from original compound file to a new compound file with different extension to avoid header/footer checks. This could be removed entirely if we migrate to a new format with our own codec with custom reader and writer.
Scoring script Scoring script is running on doc value for both the Lucene engine and native engine. This part of code will remain same.
Others KNN stats and model training will remain same.
We need to create FieldType using setVectorAttributes method of FieldType similar to KnnFloatVectorField and KnnByteVectorField so that the type is treated as vector type. To create doc value, the code can be shared among KNNVectorFieldMapper and LuceneFieldMapper
FieldType type = new FieldType();
type.setVectorAttributes(dimension, VectorEncoding.BYTE, similarityFunction);
// Add more attributes
type.freeze();
Before
After
We might either extends AbstracKnnVectorQuery or just use KnnFloatVectorQuery for native engine. The following code is from KnnFloatVectorQuery . If we implement our reader properly, this query could be used as it is.
protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException {
TopDocs results = context.reader().searchNearestVectors(this.field, this.target, this.k, acceptDocs, visitedLimit);
return results != null ? results : NO_RESULTS;
}
VectorScorer createVectorScorer(LeafReaderContext context, FieldInfo fi) throws IOException {
return fi.getVectorEncoding() != VectorEncoding.FLOAT32 ? null : VectorScorer.create(context, fi, this.target);
}
However, if we want to migrate exact search implementation later, we can extends AbstracKnnVectorQuery by overriding. createVectorScorer method is used inside exactSearch only. Therefore, if we override exactSearch we don’t need to implement createVectorScorer
protected abstract TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException;
protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator acceptIterator)
Before
After
We need to define our own NativeEngineVectorsFormat extending KnnVectorsFormat . In BasePerFieldKnnVectorsFormat, we return appropriate vector format based on engine type: NativeEngineVectorsFormat, and Lucene95HnswVectorsFormat Here, we also need to implement VectorsWriter and VectorsReader for native engine as well.
KnnVectorsReader
public abstract void checkIntegrity() throws IOException;
public abstract FloatVectorValues getFloatVectorValues(String field) throws IOException;
public abstract ByteVectorValues getByteVectorValues(String field) throws IOException;
public abstract TopDocs search(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;
public abstract TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;
KnnVectorsWriter
public abstract KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException;
public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException;
public abstract void finish() throws IOException;
Before
After
Will close for now as investigation is complete. We may revisit this topic later.
Im going to keep this issue open. I think we need to do this work at some point. A couple issues where it could be potentially helpful:
Looking at the cons: (1) same thing goes for doc values.
(2) and (3) shouldnt be listed. Its possible to just disable doc values for this field and use the vectors directly. Also, for (2), we could implement a wrapper around the native engine storage of vectors and not store by default.
(4) I dont think this will be the case (after our custom codec is deprecated).
(6) I think this may be true, but I think in the long run, we will be able to reduce effort.
Additionally, our codec right now has not been updated in awhile and the versioning is not scalable. This might help make it more maintainable.
hi @jmazanec15 @navneet1v @heemin32 , I have an idea about this topic. in some scenarios, we want to reduce the disk usage
and io throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)
"mappings": {
"_source": {
"excludes": [
"target_field1",
"target_field2",
]
}
}
so I propose to use doc_values field for the vector fields. like:
POST some_index/_search
{
"docvalue_fields": [
"vector_field1",
"vector_field2",
],
"_source": false
}'
and for this i rewrite KNNVectorDVLeafFieldData
and make KNN80BinaryDocValues
can return the specific knn docvalue_fields
like: (vector_field1
is knn field type)
"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]
optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)
for the continues dive in to knndocvalues
fields, I think when use faiss engine, we can use reconstruct_n
interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat
. or like this issue comments for redesign a KnnVectorsFormat
I think I can create a new issue and pr for the code. and report the details.
@luyuncheng not storing vector fields in _source is always an option for user. But this comes with its own limitations:
POST some_index/_search { "docvalue_fields": [ "vector_field1", "vector_field2", ], "_source": false }'
If we can support this kind of retrieval then it will be really awesome because for users who don't have usecases of re-indexing, update by query can still use it.
When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.
If you remove the fields from _soruce then you cannot do reindexing and update by query. For more details you can refer this doc: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/mapping-source-field.html#:~:text=Think%20before%20disabling%20the%20_source%20field
@navneet1v if we would support synthetic source
in the future, this would not be a problem.
When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.
@navneet1v I will create pr to show how I rewrite the KNNVectorDVLeafFieldData
, and show some details, BTW I am writing the test for it.
If you remove the fields from _soruce then you cannot do reindexing and update by query.
@navneet1v if we would support
synthetic source
in the future, this would not be a problem.
Yeah thats right but we don't. I am particularly against removing the vector field from _source as its more index mapping driven. Its just that we should be aware of the pit falls.
When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.
@navneet1v I will create pr to show how I rewrite the
KNNVectorDVLeafFieldData
, and show some details, BTW I am writing the test for it.
can you also explore the sorted doc values? I would like to ensure that we are not over engineering a solution.
@luyuncheng In terms of source storage, there is an issue over here: https://github.com/opensearch-project/OpenSearch/issues/6356
In terms of source storage, there is an issue over here: https://github.com/opensearch-project/OpenSearch/issues/6356
@jmazanec15 @navneet1v In #1571 I create a pr, which is WIP. and it shows how my idea works.
KNNVectorDVLeafFieldData
can return the vectorDocValue fields like script do. _source
with 1st step docValues fields response. and this way something like synthetic source
but need explicit add value from search body like docvalue_fields
@jmazanec15 , @heemin32 , @luyuncheng I created a POC for Moving to KNNFloatVectorValues: https://github.com/navneet1v/k-NN/commit/990a58f58df1519fa12306466c8f8a9f175ab13a
Things which are working:
Things not working:
Will work on fixing the things not working in the POC.
This code alone will not work. We need changes in OS and Lucene. Lucene PR is raised.
Training index creation is not optimal.
What do you mean by this?
So right now if someone is creating a training index, they can set index.knn: false and ensure graphs are not getting created for the index as this is just a training index. and our custom codec is not getting picked.
But as of now the way I implemented it, from KNNFieldMapper standpoint I migrating all indices greater than a specific version of use FloatVectorField. Now this FloatVectorField by default will create the k-NN graphs(This is lucene side). Hence I was saying it is not optimal.
I was trying add more complexity in the logic of when to pick LuceneVectorField vs OurKNNVectorField but then the exact search usecases will be not be optimal because then exact search will keep on using BinaryDocValues rather than KNNFloatVectorsValues.
I am thinking we should add a capability in Lucene where someone can configure they just want FlatVectors and no HNSW graphs. But not sure how much it fly in Lucene world. I will think more if there are some other sophisticated conditions with which we can do all this in Opensearch only.
Another option could be creating another codec for training index(index.knn: false) which does not create a graph but still use lucene vector format.
@heemin32 the problem is our Codec classes will not get hit if index.knn:false. So I am trying to see what we can do here.
@heemin32 the problem is our Codec classes will not get hit if index.knn:false. So I am trying to see what we can do here.
I am able to come up with a way to go around this problem, I provided the solution here: https://github.com/opensearch-project/k-NN/issues/1079#issuecomment-2195423112
Addressing in https://github.com/opensearch-project/k-NN/issues/1853
Currently, we integrate our native libraries with OpenSearch through Lucene's DocValuesFormat. At the time, Lucene did not have the KnnVectorsFormat format (which was released in 9.0).
Now that it exists, I am wondering if we should move to use KnnVectorsFormat. KnnVectorsFormat has a KnnVectorsWriter and KnnVectorsReader. Migrating to KnnVectorsFormat would allow us to:
In general, it would make the native library integrations more inline with the Lucene architecture which would have long-term benefits around maintainability and extendability.
All that being said, we need to do a deep dive into what switching means in terms of backwards compatibility and also scope out how much work needs to be done.
Tracking list of benefits of moving to KnnVectorsFormat
1467
1506
1050
675