opensearch-project / k-NN

đŸ†• Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

Fix the force merge with Quantization failures when a segment has deleted docs in it #2046

Closed navneet1v closed 3 weeks ago

navneet1v commented 3 weeks ago

Description

Fix the force merge with Quantization failures when a segment has deleted docs in it.

Issue:

When documents are deleted from the segment and segments are force merged, the FlatVectorValues are getting merged to 1 vectorValues but that vectorvalue.getLiveDocs() doesn't provide correct live docs but it provides the cost which includes the deleted docs too.

Due to this when we are doing quantization, we are hitting the vectors which are not present and leading to NPE.

org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge
        at org.opensearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2494) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:982) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge
        at org.apache.lucene.index.IndexWriter.hasPendingMerges(IndexWriter.java:2436) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.index.engine.InternalEngine$EngineMergeScheduler.afterMerge(InternalEngine.java:2452) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.OpenSearchConcurrentMergeScheduler.doMerge(OpenSearchConcurrentMergeScheduler.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
Caused by: java.lang.NullPointerException: Cannot read field "values" because "this.current" is null
        at org.apache.lucene.codecs.KnnVectorsWriter$MergedVectorValues$MergedFloat32VectorValues.vectorValue(KnnVectorsWriter.java:225) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.knn.index.vectorvalues.VectorValueExtractorStrategy$DISIVectorExtractor.extract(VectorValueExtractorStrategy.java:78) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.VectorValueExtractorStrategy.extractFloatVector(VectorValueExtractorStrategy.java:32) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.KNNFloatVectorValues.getVector(KNNFloatVectorValues.java:26) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.KNNFloatVectorValues.getVector(KNNFloatVectorValues.java:19) ~[?:?]
        at org.opensearch.knn.index.quantizationservice.KNNVectorQuantizationTrainingRequest.getVectorAtThePosition(KNNVectorQuantizationTrainingRequest.java:53) ~[?:?]
        at org.opensearch.knn.quantization.quantizer.QuantizerHelper.calculateMeanThresholds(QuantizerHelper.java:35) ~[?:?]
        at org.opensearch.knn.quantization.quantizer.OneBitScalarQuantizer.train(OneBitScalarQuantizer.java:63) ~[?:?]
        at org.opensearch.knn.index.quantizationservice.QuantizationService.train(QuantizationService.java:69) ~[?:?]
        at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.trainAndIndex(NativeEngines990KnnVectorsWriter.java:262) ~[?:?]
        at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.mergeOneField(NativeEngines990KnnVectorsWriter.java:104) ~[?:?]
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(PerFieldKnnVectorsFormat.java:121) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.codecs.KnnVectorsWriter.merge(KnnVectorsWriter.java:99) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.mergeVectorValues(SegmentMerger.java:282) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:325) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:171) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:660) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.index.engine.OpenSearchConcurrentMergeScheduler.doMerge(OpenSearchConcurrentMergeScheduler.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
[2024-09-05T02:44:02,224][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-0] [my-knn-index-1][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [my-knn-index-1][0]: Recovery failed on {integTest-0}{FZH6cEHdT167xfsFW4y9Gg}{EmKDYp9tSyaQcRhQxPwCUg}{127.0.0.1}{127.0.0.1:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}
        at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3878) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:626) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:123) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2895) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:982) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
        ... 11 more
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
        at org.opensearch.index.translog.Translog.ensureOpen(Translog.java:1882) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.translog.Translog.getMaxSeqNo(Translog.java:1959) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.translog.InternalTranslogManager.getMaxSeqNo(InternalTranslogManager.java:398) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:337) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:223) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:43) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2605) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2511) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2483) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:750) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        ... 8 more

With this change, we are ensuring that we are iterating over the vector values to find the correct live docs and then pass it to Quantization and NativeIndexWriter.

Testing

I added an IT that simulate this buggy nature, but due to some race conditions it doesn't work 100% of the time. I will follow up on fixing the IT and make it more strict.

Manual Testing

Currently I tested manually by below flow, which always create exceptions without this fix.

curl --request PUT \
  --url http://localhost:9200/my-knn-index-1 \
  --header 'Content-Type: application/json' \
  --data '{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "refresh_interval": "1s",
            "knn": true
        }
    },
    "mappings": {
        "properties": {
            "my_vector2": {
                "type": "knn_vector",
                "dimension": 4,
                "space_type": "innerproduct"
            },
            "my_vector3": {
                "type": "knn_vector",
                "dimension": 4,
                "space_type": "innerproduct",
                "mode": "on_disk",
                "compression_level": "32x"
            }
        }
    }
}'

Index Data

curl --request POST \
  --url 'http://localhost:9200/_bulk?refresh=' \
  --header 'Content-Type: application/json' \
  --data '{ "index": { "_index": "my-knn-index-1", "_id": "111" } }
{ "my_vector2": [-1.5, -5.5, -4.5, -6.4], "my_vector3": [1.5, 5.5, 4.5, 6.4], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "9111" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "my_vector3": [-1.5, -5.5, -4.5, -6.4], "price": 8.9, "message": "OpenSearch" }
'

Index data again to simulate deletes

curl --request POST \
  --url 'http://localhost:9200/_bulk?refresh=' \
  --header 'Content-Type: application/json' \
  --data '{ "index": { "_index": "my-knn-index-1", "_id": "111" } }
{ "my_vector2": [-1.5, -5.5, -4.5, -6.4], "my_vector3": [1.5, 5.5, 4.5, 6.4], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "9111" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "my_vector3": [-1.5, -5.5, -4.5, -6.4], "price": 8.9, "message": "OpenSearch" }
'

Do force merge

curl --request POST \
  --url 'http://localhost:9200/my-knn-index-1/_forcemerge?max_num_segments=1&wait_for_completion=true' \
  --header 'Content-Type: application/json'

check logs to see the errors.

Do Search it fails without this fix

curl --request POST \
  --url http://localhost:9200/my-knn-index-1/_search \
  --header 'Content-Type: application/json' \
  --data '{
    "query":{
        "knn":{
            "my_vector2": {
                "vector": [1,1,1,1],
                "k": 10
            }
        }
    }
}'

Issue

https://github.com/opensearch-project/k-NN/issues/1949

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

navneet1v commented 3 weeks ago

Checking why CIs are failing

navneet1v commented 3 weeks ago

Checking why CIs are failing

CIs were failing because the code was skipping some docs during live docs counting. Now the logic is fixed. Thanks @jmazanec15 for help in debugging.

Vikasht34 commented 3 weeks ago

LGTM

navneet1v commented 3 weeks ago

waiting on CIs to complete.