opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
61 stars 64 forks source link

[BUG] Neural search: ArrayIndexOutOfBoundsException: Index 495884 out of bounds for length 1 #666

Open lihuimingxs opened 6 months ago

lihuimingxs commented 6 months ago

Describe the bug

What is the bug? A clear and concise description of the bug.

In Opensearch 2.12.0:

When using GPU and initiating concurrent requests using neural retrieval, an ArrayIndexOutOfBoundsException exception was encountered.

I'm not sure if it's a concurrency issue, but what I can know is that a single request is successful, and exceptions only occur when there are concurrent requests.

Number of concurrent requests: More than 5 times

Request:

GET irp_index_vec/_search
{
      "size": 100, 
      "query": {
          "bool": {
              "filter": [
                  {
                      "bool": {
                          "must": [
                              {
                                  "terms": {
                                      "stat": [
                                          1
                                      ]
                                  }
                              }
                          ]
                      }
                  }
              ],
              "must": [
                  {
                      "neural": {
                          "embeddingCnVector1": {
                              "query_text": "some content",
                              "k": 100
                          }
                      }
                  }
              ]
          }
      }
}

Exception:

[2024-03-28T19:30:27,355][WARN ][r.suppressed             ] [opensearch-cluster_manager] path: /irp_index_vec/_search, params: {typed_keys=true, index=irp_index_vec}
org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:722) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:298) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:138) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.CountedCollector.countDown(CountedCollector.java:66) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.CountedCollector.onFailure(CountedCollector.java:85) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:257) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:766) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.TransportService$9.handleException(TransportService.java:1725) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:404) [opensearch-security-2.12.0.0.jar:2.12.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1511) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$5(InboundHandler.java:447) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.handleException(InboundHandler.java:445) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:437) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:170) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.12.0.jar:2.12.0]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) [netty-codec-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) [netty-codec-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.106.Final.jar:4.1.106.Final]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.OpenSearchException$3: Index 495884 out of bounds for length 1
        at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:708) ~[opensearch-core-2.12.0.jar:2.12.0]
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:377) [opensearch-2.12.0.jar:2.12.0]
        ... 49 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 495884 out of bounds for length 1
        at org.apache.lucene.util.SparseFixedBitSet.get(SparseFixedBitSet.java:129) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
        at org.opensearch.search.fetch.FetchPhase.findRootDocumentIfNested(FetchPhase.java:283) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.search.fetch.FetchPhase.prepareHitContext(FetchPhase.java:299) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.search.fetch.FetchPhase.execute(FetchPhase.java:172) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.search.SearchService.lambda$executeFetchPhase$3(SearchService.java:782) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) ~[opensearch-2.12.0.jar:2.12.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.12.0.jar:2.12.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        ... 1 more

Related component

Search

To Reproduce

  1. Create neural retrieval concurrent requests
  2. View cluster startup logs
  3. See the error logs

Expected behavior

Neural Search is ok when used GPU.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here.

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @opensearch-project/admin could you please transfer this to the neural search repository?

navneet1v commented 5 months ago

@lihuimingxs can you try removing the neural search query clause from the query and run the query with 5 concurrent request? Because what I can see is stack trace is from fetch phase this could be an issue in opensearch core. Because neural query doesn't touch fetch phase of Search.

jmazanec15 commented 1 month ago

@lihuimingxs are you still facing the issue?