Investigate migrating custom codec from BinaryDocValuesFormat to KnnVectorsFormat

jmazanec15 commented 1 year ago

Currently, we integrate our native libraries with OpenSearch through Lucene's DocValuesFormat. At the time, Lucene did not have the KnnVectorsFormat format (which was released in 9.0).

Now that it exists, I am wondering if we should move to use KnnVectorsFormat. KnnVectorsFormat has a KnnVectorsWriter and KnnVectorsReader. Migrating to KnnVectorsFormat would allow us to:

reduce branching logic between the different engines we support
avoid duplicate effort. For instance, we might be able to more easily support features such as incremental index building: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java#L37.
support future features easier

In general, it would make the native library integrations more inline with the Lucene architecture which would have long-term benefits around maintainability and extendability.

All that being said, we need to do a deep dive into what switching means in terms of backwards compatibility and also scope out how much work needs to be done.

Tracking list of benefits of moving to KnnVectorsFormat

1467
1506
1050
675
Incremental index building for vector fields
Less blast radius for codec - right now, for any k-NN index, we will handle all binary doc values in our consumer. If there is an issue, potentially impacting non-knn fields. For this, we could just implement KnnVectorsFormat and only be responsible for the field's knn vectors
Open up possibility to use non FSDirectory backed indices (i.e. remote storage)
Use byte values with native libraries without reimplementation of int8 docvalues

heemin32 commented 1 year ago

Overview

When we add KNN support using native engine library(Faiss, NMSLib), there was no KnnVectorsFormat available in the Lucene library. So we used BinaryDocValuesFormat to store the vector data and added knn index using the native engine library while indexing doc value. For searching, we implemented our own query and utilized native engine index for ANN and doc value data for exact search.

Now as KnnVectorsFormat is available in the Lucene library, we are exploring an opportunity to move from DocValuesFormat to KnnVectorsFormat to see if there are any benefit of doing it.

With KnnVectorsFormat, we cannot store vector data and add native engine index along with it as we did with DocValuesFormat because it will create the Lucene vector data index which will be there for nothing but only consumes resources. Therefore, we need to override all the implementation extending VectorFormat class for both indexing and querying.

After comparing pros and cons of migration, we want to hold on it until we have more data points to support migration given a required effort for it.

Current Behavior

Few points to note

We cannot create stored field of knn field. Stored or not is maintained in a variable of mapper class. However, we check fieldType.stored
Doc value is always needed for exact search. If we turn it off for the Lucene engine, exact search fails.
We cannot disable doc value of knn field with non-Lucene engine.
1. For non-Lucene engine, we always store value as doc value and do an additional indexing using native library.

Indexing When we use the Lucene engine, we store the vector data using either KnnByteVectorField or KnnFloatVectorField. If doc value is enabled, we store the vactor data in binary doc value format as well. Then, in the codec, we delegate both binary doc value and knn vector format to the Lucene library to handle. For native library, we always store the vector data in binary doc value format. Then, in the codec, we delegate binary doc value process to the Lucene library. After that, if the field is knn field, we run additional process to index the value using the native library, faiss, and nmslib. Native engine also uses doc value during merge.

Searching For the Lucene engine, we use existing KNNByteVectorQuery and KnnFloatVectorQuery. Both of them extends KnnVectorQuery class which has common methods to support vector query. By returning the query class, the rest of the process is handled by OpenSearch. the Lucene engine switch to exact search based on internal logic and it does not need doc value to be enabled to do the exact search.

For native engine, we have our own KNNQuery class. KNNQuery has its own logic to switch to exact search. For exact search to work, doc value is needed. ANN search works without doc value. Native engine library support vector data access by id but need to test how efficient it will be compared to doc value.

Compound file handling There is KNN80CompoundFormat. Its role is moving file generated by native engine from original compound file to a new compound file with different extension to avoid header/footer checks. This could be removed entirely if we migrate to a new format with our own codec with custom reader and writer.

Scoring script Scoring script is running on doc value for both the Lucene engine and native engine. This part of code will remain same.

Others KNN stats and model training will remain same.

Migration detail

Mapper

We need to create FieldType using setVectorAttributes method of FieldType similar to KnnFloatVectorField and KnnByteVectorField so that the type is treated as vector type. To create doc value, the code can be shared among KNNVectorFieldMapper and LuceneFieldMapper

FieldType type = new FieldType();
type.setVectorAttributes(dimension, VectorEncoding.BYTE, similarityFunction);
// Add more attributes
type.freeze();

Before

After

Query

We might either extends AbstracKnnVectorQuery or just use KnnFloatVectorQuery for native engine. The following code is from KnnFloatVectorQuery . If we implement our reader properly, this query could be used as it is.

    protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException {
        TopDocs results = context.reader().searchNearestVectors(this.field, this.target, this.k, acceptDocs, visitedLimit);
        return results != null ? results : NO_RESULTS;
    }

    VectorScorer createVectorScorer(LeafReaderContext context, FieldInfo fi) throws IOException {
        return fi.getVectorEncoding() != VectorEncoding.FLOAT32 ? null : VectorScorer.create(context, fi, this.target);
    }

However, if we want to migrate exact search implementation later, we can extends AbstracKnnVectorQuery by overriding. createVectorScorer method is used inside exactSearch only. Therefore, if we override exactSearch we don’t need to implement createVectorScorer

protected abstract TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException;
protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator acceptIterator)

Before

After

Codec

We need to define our own NativeEngineVectorsFormat extending KnnVectorsFormat . In BasePerFieldKnnVectorsFormat, we return appropriate vector format based on engine type: NativeEngineVectorsFormat, and Lucene95HnswVectorsFormat Here, we also need to implement VectorsWriter and VectorsReader for native engine as well.

KnnVectorsReader

public abstract void checkIntegrity() throws IOException;
public abstract FloatVectorValues getFloatVectorValues(String field) throws IOException;
public abstract ByteVectorValues getByteVectorValues(String field) throws IOException;
public abstract TopDocs search(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;
public abstract TopDocs search(String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;

KnnVectorsWriter
public abstract KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException;
public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException;
public abstract void finish() throws IOException;

Before

After

Pros

Feature support - Graph build on the fly during indexing
1. In KnnVectorsFormat class, it expose fieldsWriter so that child class can implement their own field writer. In the field writer, there is addField method which is called whenever index is happening. Therefore, the Lucene can build graph on the fly during index.
2. On the other hand, DocValuesFormat expose fieldsConsumer and fieldsProducer method. There, only addBinaryField and merge method can be overridable. addBinaryField is called only when refresh happens.
3. Therefore, using doc value format, we cannot build graph on the fly during indexing for native engine unless we make a change in either OpenSearch or the Lucene side(Don’t know where and how yet) to add a new method which is called for every new doc value added.
Reduced code branching - Interface sharing, Feature support readiness
1. By extending same abstract class, there could be less deviation on the overall code structure between the Lucene engine and native engine. If new feature is added in the Lucene through the abstract class, it could be easier to support the same feature using native engine by implementing those new methods. Graph build on the fly during indexing is one of such example and sequantial/parallel search could be another.

Cons

Strong dependency on the Lucene with native engine implementation
1. There will be a risk of not being able to make a change according to changes in the Lucene interface which will lead to build failure and block the entire OpenSearch release.
2. For example, in this PR, they introduced KnnCollector and changed method signature in KnnVectorsFormat, KnnVectorsReader and few other classes. If we extends those class, our code needs to be updated when OpenSearch update the Lucene to that specific version which will be a big maintenance burden on us.
Increased disk space
1. Scoring script does not work as it reads data from doc value. “unexpected docvalues type NONE for field 'my_vector1' (expected=BINARY). Re-index with correct docvalues type.”
2. Aside from scoring script, if we do exact search for native engine without using doc value, we can avoid store document in doc value for native engine. For example, the Lucene engine does not use doc value for exact search.
3. However, we might end up using more disk space by storing both doc value and knn vector value
4. Unless native engine support a separation between graph file and vector file, there won’t be much reduction on disk usage as graph file will contains vector data and we ourselves also need to have vector file for random access.
Increased indexing latency
1. If need to store vector data in both doc value format and vector value format. This will increase index latency for native engine but not by far.
Increased code size
1. We used doc value codec to do all the works except indexing and searching. After the migration, we need our own implementation of codec which will increase the line of codes. However, as the code evolve and we keep feature parity between native engine and the Lucene engine, initial increase in code size will be paid off.
Effort to migration
1. The effort needs for this migration is not small as we need to write our own codec from scratch.
Not enough feature that we can get benefited out of the box
1. Graph build on the fly during indexing won’t happen for native engine even after the migration. Native engine should support adding node to an existing graph first and we need to implement to call the method in addField method.

Other points

Feature support - Reusing existing graph during merge
1. Incremental graph build during merge can be done with current implementation as long as native library support adding single node in existing graph.

jmazanec15 commented 8 months ago

Will close for now as investigation is complete. We may revisit this topic later.

jmazanec15 commented 6 months ago

Im going to keep this issue open. I think we need to do this work at some point. A couple issues where it could be potentially helpful:

1467
1506
1050
675 - edit - this would also be easier to add

Looking at the cons: (1) same thing goes for doc values.

(2) and (3) shouldnt be listed. Its possible to just disable doc values for this field and use the vectors directly. Also, for (2), we could implement a wrapper around the native engine storage of vectors and not store by default.

(4) I dont think this will be the case (after our custom codec is deprecated).

(6) I think this may be true, but I think in the long run, we will be able to reduce effort.

Additionally, our codec right now has not been updated in awhile and the versioning is not scalable. This might help make it more maintainable.

luyuncheng commented 6 months ago

hi @jmazanec15 @navneet1v @heemin32 , I have an idea about this topic. in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)

"mappings": { 
  "_source": { 
    "excludes": [
      "target_field1",
      "target_field2",
     ]
  }
}

so I propose to use doc_values field for the vector fields. like:

POST some_index/_search
{
  "docvalue_fields": [
    "vector_field1",
    "vector_field2",
  ],
  "_source": false
}'

and for this i rewrite KNNVectorDVLeafFieldData and make KNN80BinaryDocValues can return the specific knn docvalue_fields like: (vector_field1 is knn field type)

"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]

optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat

I think I can create a new issue and pr for the code. and report the details.

navneet1v commented 6 months ago

@luyuncheng not storing vector fields in _source is always an option for user. But this comes with its own limitations:

If you remove the fields from _soruce then you cannot do reindexing and update by query.

navneet1v commented 6 months ago

POST some_index/_search { "docvalue_fields": [ "vector_field1", "vector_field2", ], "_source": false }'

If we can support this kind of retrieval then it will be really awesome because for users who don't have usecases of re-indexing, update by query can still use it.

When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.

luyuncheng commented 6 months ago

If you remove the fields from _soruce then you cannot do reindexing and update by query. For more details you can refer this doc: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/mapping-source-field.html#:~:text=Think%20before%20disabling%20the%20_source%20field

@navneet1v if we would support synthetic source in the future, this would not be a problem.

luyuncheng commented 6 months ago

When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.

@navneet1v I will create pr to show how I rewrite the KNNVectorDVLeafFieldData, and show some details, BTW I am writing the test for it.

navneet1v commented 6 months ago

If you remove the fields from _soruce then you cannot do reindexing and update by query.

@navneet1v if we would support synthetic source in the future, this would not be a problem.

Yeah thats right but we don't. I am particularly against removing the vector field from _source as its more index mapping driven. Its just that we should be aware of the pit falls.

navneet1v commented 6 months ago

When I tried running the similar query some time back, I got the issue that field doesn't support Sortedbinarydocvalues. So I am wondering we might need to just ensure that bianry doc values implements sorted doc values interface.

@navneet1v I will create pr to show how I rewrite the KNNVectorDVLeafFieldData, and show some details, BTW I am writing the test for it.

can you also explore the sorted doc values? I would like to ensure that we are not over engineering a solution.

jmazanec15 commented 6 months ago

@luyuncheng In terms of source storage, there is an issue over here: https://github.com/opensearch-project/OpenSearch/issues/6356

luyuncheng commented 6 months ago

In terms of source storage, there is an issue over here: https://github.com/opensearch-project/OpenSearch/issues/6356

@jmazanec15 @navneet1v In #1571 I create a pr, which is WIP. and it shows how my idea works.

1st I made KNNVectorDVLeafFieldData can return the vectorDocValue fields like script do.
2nd I write a KNNFetchSubPhase class which add a process in fetch phase, and it can fulfill the _source with 1st step docValues fields response. and this way something like synthetic source but need explicit add value from search body like docvalue_fields

navneet1v commented 4 months ago

@jmazanec15 , @heemin32 , @luyuncheng I created a POC for Moving to KNNFloatVectorValues: https://github.com/navneet1v/k-NN/commit/990a58f58df1519fa12306466c8f8a9f175ab13a

Things which are working:

Nmslib and Faiss engine HNSW index is working as expected.
Lucene is working as expected.
BWC is maintained.

Things not working:

Training index creation is not optimal.
Efficient filters with Faiss is not working.

Will work on fixing the things not working in the POC.

This code alone will not work. We need changes in OS and Lucene. Lucene PR is raised.

jmazanec15 commented 4 months ago

Training index creation is not optimal.

What do you mean by this?

navneet1v commented 4 months ago

So right now if someone is creating a training index, they can set index.knn: false and ensure graphs are not getting created for the index as this is just a training index. and our custom codec is not getting picked.

But as of now the way I implemented it, from KNNFieldMapper standpoint I migrating all indices greater than a specific version of use FloatVectorField. Now this FloatVectorField by default will create the k-NN graphs(This is lucene side). Hence I was saying it is not optimal.

I was trying add more complexity in the logic of when to pick LuceneVectorField vs OurKNNVectorField but then the exact search usecases will be not be optimal because then exact search will keep on using BinaryDocValues rather than KNNFloatVectorsValues.

I am thinking we should add a capability in Lucene where someone can configure they just want FlatVectors and no HNSW graphs. But not sure how much it fly in Lucene world. I will think more if there are some other sophisticated conditions with which we can do all this in Opensearch only.

heemin32 commented 4 months ago

Another option could be creating another codec for training index(index.knn: false) which does not create a graph but still use lucene vector format.

navneet1v commented 4 months ago

@heemin32 the problem is our Codec classes will not get hit if index.knn:false. So I am trying to see what we can do here.

navneet1v commented 3 months ago

@heemin32 the problem is our Codec classes will not get hit if index.knn:false. So I am trying to see what we can do here.

I am able to come up with a way to go around this problem, I provided the solution here: https://github.com/opensearch-project/k-NN/issues/1079#issuecomment-2195423112

jmazanec15 commented 1 month ago

Addressing in https://github.com/opensearch-project/k-NN/issues/1853

opensearch-project / k-NN