opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Add IT and BWC tests with Indices containing both Vector and Non Vector documents #2284

Open navneet1v opened 1 day ago

navneet1v commented 1 day ago

Description

Currently in k-NN plugin all the ITs and BWC created has indices with vector fields and all the documents contain vector field. But in production indices it is not necessary that a k-NN index documents will always have the vector field in it or to say all the vector fields in it. Due to these kind of tests being missing we are not able to catch issues which are fixed in these PRs:

  1. NPE exception during Disk based vector search due to segment not containing a vector field. Ref: https://github.com/opensearch-project/k-NN/issues/2277
  2. The feature of releasing the memory during closing of the index introduced a bug where if a segment has a knn_vector field but no docs with this field present, then an index OOB exception will be thrown. This was fixed in https://github.com/opensearch-project/k-NN/pull/2182.
    Caused by: NotSerializableExceptionWrapper[index_out_of_bounds_exception: Index 0 out of bounds for length 0]
    at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
    at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
    at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
    at java.util.Objects.checkIndex(Objects.java:385)
    at java.util.ArrayList.get(ArrayList.java:427)
    at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesProducer.<init>(KNN80DocValuesProducer.java:78)
    at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesFormat.fieldsProducer(KNN80DocValuesFormat.java:44)
    at org.apache.lucene.index.SegmentDocValues.newDocValuesProducer(SegmentDocValues.java:52)

Proposal

To catch the above issues during PRs we should add tests(BWC and ITs) for all 3 engines and disk based vector search. For 1, I added the integration tests with the fix https://github.com/opensearch-project/k-NN/blob/2d1a4080d5b1601bf3362fecd85384348af1f326/src/test/java/org/opensearch/knn/integ/ModeAndCompressionIT.java#L225 . We need to similar thing for BWC and other engines.

Tests to be added

  1. BWC test for all versions where an index has 10 docs where 9 contain vector fields and 1 is no vector field. The ingestion should happen such that document with no vector field gets its own segment. ref proposal section.
  2. Similar to BWC we should have ITs that cover these scenario for an index created similar to step 1
    1. Vector search with Faiss
    2. Filters tests with Faiss.
    3. Disk based vector search with default compression
    4. Lucene engine tests

Please suggest more tests if there are any.