opensearch-project / k-NN

šŸ†• Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Fix a flaky unit test:testMultiFieldsKnnIndex, which was failing due to inconsistent merge behaviors #1924

Closed navneet1v closed 3 months ago

navneet1v commented 3 months ago

Description

Fix a flaky unit test:testMultiFieldsKnnIndex, which was failing due to inconsistent merge behaviors

Validated by running a failing seed:

./gradlew ':test' --tests "org.opensearch.knn.index.codec.KNN990Codec.KNN990CodecTests.testMultiFieldsKnnIndex" -Dtests.seed=F72B6D8EB481B0BB

Old Logs:

The log shows that refresh and merges happening multiple times, resulting in 4 files, 2 from old segments and 2 files from new merged segment.

2> INFO: Using MemorySegmentIndexInput and native madvise support with Java 21 or later; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
  1> [2024-08-02T03:53:40,805][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Refresh operation complete in 60 ms
  1> [2024-08-02T03:53:40,926][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Refresh operation complete in 2 ms
  1> [2024-08-02T03:53:41,029][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Merge operation complete in 4 ms
  1> [2024-08-02T03:53:41,072][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Merge operation complete in 3 ms
  1> [2024-08-02T03:53:41,075][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Merge operation complete in 3 ms
  1> [_0.fdm, _0.fdt, _0.fdx, _0.fnm, _0.si, _0_2011_test_vector.hnsw, _0_Lucene90_0.dvd, _0_Lucene90_0.dvm, _2.fdm, _2.fdt, _2.fdx, _2.fnm, _2.si, _2_2011_my_vector.hnsw, _2_Lucene90_0.dvd, _2_Lucene90_0.dvm, _3.fdm, _3.fdt, _3.fdx, _3.fnm, _3.si, _3_2011_my_vector.hnsw, _3_2011_test_vector.hnsw, _3_Lucene90_0.dvd, _3_Lucene90_0.dvm, segments_3, write.lock]

Logs after fix

org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
  2> INFO: Using MemorySegmentIndexInput and native madvise support with Java 21 or later; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
  1> [2024-08-02T03:59:26,262][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Refresh operation complete in 78 ms
  1> [2024-08-02T03:59:26,304][WARN ][o.o.k.i.c.K.KNN80DocValuesConsumer] [testMultiFieldsKnnIndex] Refresh operation complete in 2 ms
  1> [2024-08-02T03:59:26,788][INFO ][o.o.k.i.c.K.KNN990CodecTests] [testMultiFieldsKnnIndex] after test
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

Related Issues

Resolves https://github.com/opensearch-project/k-NN/issues/1828

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jmazanec15 commented 3 months ago

@navneet1v How does this fix it?

navneet1v commented 3 months ago

@jmazanec15 so earlier the first Index Writer instance was using no merge policy but the other instance was using a merge policy. So, I ensured we have 1 IW instance.

Second, I simplified the logic for the test. IW on a JVM doesn't get closed till a JVM crashes or an index is marked as readOnly. Hence I fixed the test to be more simplified.

Third: whenever we close the IW, it does a flush and merge on flush is also a setting. But when the merge happened we didn't refresh the IR. Hence to avoid more complication on the IW and IR, I simplified the test logic.

The merge on flushes/IW close is not predictable, and better predictability is required in test to be consistent. Hence I make sure that flow is simple and predictable.

You can confirm the same using logs I provided.

I hope this ans the question.