opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[BUG] NPE while calling ANN search when deleted docs or a segment with no vector field present in the index #2277

Closed navneet1v closed 1 day ago

navneet1v commented 5 days ago

Description

When a k-NN index(with on_disk mode) has deleted documents/ in it, then while doing the search the search is failing with NPE.

Impacted cases, refer below sections for workarounds:

  1. This issue will happen only in case of on_disk mode is set to true with rescoring.
  2. If on_disk mode index has deleted docs and segment not commited.
  3. If on_disk mode index has some segments which has no vector field where query is happening.
  4. If on_disk mode index has shards which has no vector field query is happening on the field.

Please refer the below steps for reproduction.

Steps to Reproduce with deleted docs

Create Index

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 8,
        "mode": "on_disk"
        }
      }
    }
  }
}

Ingest 2 documents

PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"my_vector":[1,1,1,1,1,1,1,1]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"my_vector":[1,1,1,1,1,1,1,1]}

Search working as expected

GET my-knn-index-1/_search
{
  "query": {
    "knn": {
      "my_vector": {
        "vector": [
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1
        ],
        "k": 3
      }
    }
  }
}

Delete a document

DELETE my-knn-index-1/_doc/1

Search Again with error

GET my-knn-index-1/_search
{
  "query": {
    "knn": {
      "my_vector": {
        "vector": [
          1,
          1,
          1,
          1,
          1,
          1,
          1,
          1
        ],
        "k": 3
      }
    }
  }
}

Error Response

{
  "error": {
    "root_cause": [
      {
        "type": "null_pointer_exception",
        "reason": "Cannot invoke \"org.apache.lucene.index.FieldInfo.getAttribute(String)\" because \"fieldInfo\" is null"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my-knn-index-1",
        "node": "l4iYtRCNSDOBxppXQULzig",
        "reason": {
          "type": "null_pointer_exception",
          "reason": "Cannot invoke \"org.apache.lucene.index.FieldInfo.getAttribute(String)\" because \"fieldInfo\" is null"
        }
      }
    ],
    "caused_by": {
      "type": "null_pointer_exception",
      "reason": "Cannot invoke \"org.apache.lucene.index.FieldInfo.getAttribute(String)\" because \"fieldInfo\" is null",
      "caused_by": {
        "type": "null_pointer_exception",
        "reason": "Cannot invoke \"org.apache.lucene.index.FieldInfo.getAttribute(String)\" because \"fieldInfo\" is null"
      }
    }
  },
  "status": 500
}

Stack Trace

opensearch-node1       | org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:775) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:395) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:815) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:548) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:316) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:104) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:766) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.TransportService$9.handleException(TransportService.java:1741) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:420) [opensearch-security-2.18.0.0.jar:2.18.0.0]
opensearch-node1       |    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1527) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:454) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.NativeMessageHandler.handleException(NativeMessageHandler.java:452) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.NativeMessageHandler.handlerResponseError(NativeMessageHandler.java:444) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.NativeMessageHandler.handleMessage(NativeMessageHandler.java:172) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.NativeMessageHandler.messageReceived(NativeMessageHandler.java:126) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundHandler.messageReceivedFromPipeline(InboundHandler.java:120) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:112) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:796) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundBytesHandler.forwardFragments(InboundBytesHandler.java:137) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundBytesHandler.doHandleBytes(InboundBytesHandler.java:77) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:124) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:113) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.18.0.jar:2.18.0]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1503) [netty-handler-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1366) [netty-handler-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1415) [netty-handler-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) [netty-codec-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) [netty-codec-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.114.Final.jar:4.1.114.Final]
opensearch-node1       |    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-node1       | Caused by: org.opensearch.OpenSearchException$3: Cannot invoke "org.apache.lucene.index.FieldInfo.getAttribute(String)" because "fieldInfo" is null
opensearch-node1       |    at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:710) ~[opensearch-core-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:393) [opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    ... 51 more
opensearch-node1       | Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.index.FieldInfo.getAttribute(String)" because "fieldInfo" is null
opensearch-node1       |    at org.opensearch.knn.common.FieldInfoExtractor.getSpaceType(FieldInfoExtractor.java:89) ~[?:?]
opensearch-node1       |    at org.opensearch.knn.index.query.ExactSearcher.getKNNIterator(ExactSearcher.java:153) ~[?:?]
opensearch-node1       |    at org.opensearch.knn.index.query.ExactSearcher.searchLeaf(ExactSearcher.java:62) ~[?:?]
opensearch-node1       |    at org.opensearch.knn.index.query.KNNWeight.exactSearch(KNNWeight.java:388) ~[?:?]
opensearch-node1       |    at org.opensearch.knn.index.query.nativelib.NativeEngineKnnVectorQuery.lambda$doRescore$1(NativeEngineKnnVectorQuery.java:124) ~[?:?]
opensearch-node1       |    at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
opensearch-node1       |    at org.apache.lucene.search.TaskExecutor$TaskGroup$1.run(TaskExecutor.java:120) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
opensearch-node1       |    at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:176) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
opensearch-node1       |    at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:84) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
opensearch-node1       |    at org.opensearch.knn.index.query.nativelib.NativeEngineKnnVectorQuery.doRescore(NativeEngineKnnVectorQuery.java:127) ~[?:?]
opensearch-node1       |    at org.opensearch.knn.index.query.nativelib.NativeEngineKnnVectorQuery.createWeight(NativeEngineKnnVectorQuery.java:73) ~[?:?]
opensearch-node1       |    at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:899) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
opensearch-node1       |    at org.opensearch.search.internal.ContextIndexSearcher.createWeight(ContextIndexSearcher.java:226) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560) ~[lucene-core-9.12.0.jar:9.12.0 e913796758de3d9b9440669384b29bec07e6a5cd - 2024-09-25 16:37:02]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:355) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:462) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:450) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWith(QueryPhase.java:432) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.query.QueryPhaseSearcherWrapper.searchWith(QueryPhaseSearcherWrapper.java:60) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.neuralsearch.search.query.HybridQueryPhaseSearcher.searchWith(HybridQueryPhaseSearcher.java:61) ~[?:?]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:282) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:155) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:646) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:710) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:679) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.18.0.jar:2.18.0]
opensearch-node1       |    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
opensearch-node1       |    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
opensearch-node1       |    at java.lang.Thread.run(Thread.java:1583) ~[?:?]

Impact Versions

5.17 and 2.18

Root cause

Based on my deep-dive what I can see is when documents are deleted in Opensearch, opensearch marks the document as deleted in the main segment but also creates a segment which has deleted docs in it. Refer below response for segments:

GET _cat/segments/my-knn-index-1?format=json

[
  {
    "index": "my-knn-index-1",
    "shard": "0",
    "prirep": "p",
    "ip": "172.18.0.3",
    "segment": "_0",
    "generation": "0",
    "docs.count": "1",
    "docs.deleted": "1",
    "size": "5.1kb",
    "size.memory": "0",
    "committed": "false",
    "searchable": "true",
    "version": "9.12.0",
    "compound": "true"
  },
  {
    "index": "my-knn-index-1",
    "shard": "0",
    "prirep": "p",
    "ip": "172.18.0.3",
    "segment": "_1",
    "generation": "1",
    "docs.count": "0",
    "docs.deleted": "1",
    "size": "3.1kb",
    "size.memory": "0",
    "committed": "false",
    "searchable": "true",
    "version": "9.12.0",
    "compound": "true"
  }
]

and now in this segment as there is no document present then no field info is present too. Due to no field info present, when rescore phase happens in disk based vector search, we are not able to get the fieldInfo for the field because that field is not present in the segment.

The bug is not limited to deleted docs segment, but will also happen if a segment doesn't contain the vector field, because in that case too, the field info will be null for that segment. Validated the same by ingesting a doc where segments have vector field and no vector field.

PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"my_vector":[1,1,1,1,1,1,1,1]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"my_vector":[1,1,1,1,1,1,1,1]}

PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"my": "aaa"}

Line which is giving null field info: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/ExactSearcher.java#L152

NPE will come from this line: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/common/FieldInfoExtractor.java#L88

Thanks @heemin32 for reporting the bug related to deleted docs.

Workaround

  1. If the index has segments with deleted docs only then doing flush on the index may solve the problem. POST <index-name>/_flush
  2. For segment with no vector field, force merge to 1 segment is required to ensure that there are no segments without vector field.

Cases with No workaround

If the docs containing and not containing vectors are divided among different shards and one shard has no vector field doc then there is no workaround since in that shard there will be no doc with vector field hence field info will never be present and exception will keep on happening.

navneet1v commented 4 days ago

created a PR for the fix: #2278

navneet1v commented 1 day ago

Bug fixed in pR: https://github.com/opensearch-project/k-NN/pull/2281 closing this issue