opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Search fails with `java.lang.Exception: Check failed: it's either a bug or inconsistent data!` #322

Closed tanguilp closed 3 years ago

tanguilp commented 3 years ago

When searching a KNN index, I get the following error:

[2021-02-11T08:45:46,295][WARN ][r.suppressed             ] [3b21158fbf5e] path: /20210210t155507z_catalog_5_vectormodel_86/_search, params: {index=20210210t155507z_catalog_5_vectormodel_86}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:568) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:324) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:603) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:400) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:70) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:258) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:640) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1290) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1264) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:61) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:56) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:51) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) [elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0.jar:7.10.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.ElasticsearchException$1: java.util.concurrent.ExecutionException: java.lang.Exception: Check failed: it's either a bug or inconsistent data!
    at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:322) [elasticsearch-7.10.0.jar:7.10.0]
    ... 22 more
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.Exception: Check failed: it's either a bug or inconsistent data!
    at com.amazon.opendistroforelasticsearch.knn.index.KNNIndexCache.getIndex(KNNIndexCache.java:157) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNWeight.scorer(KNNWeight.java:100) ~[?:?]
    at org.apache.lucene.search.Weight.bulkScorer(Weight.java:182) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:272) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:223) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:341) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:296) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:148) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:431) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.access$500(SearchService.java:141) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:401) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.0.jar:7.10.0]
    ... 6 more
Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: Check failed: it's either a bug or inconsistent data!
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564) ~[?:?]
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:525) ~[?:?]
    at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:102) ~[?:?]
    at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:237) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2313) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2279) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045) ~[?:?]
    at com.google.common.cache.LocalCache.get(LocalCache.java:3951) ~[?:?]
    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNIndexCache.getIndex(KNNIndexCache.java:154) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNWeight.scorer(KNNWeight.java:100) ~[?:?]
    at org.apache.lucene.search.Weight.bulkScorer(Weight.java:182) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:272) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:223) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:341) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:296) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:148) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:431) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.access$500(SearchService.java:141) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:401) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.0.jar:7.10.0]
    ... 6 more
Caused by: java.lang.Exception: Check failed: it's either a bug or inconsistent data!
    at com.amazon.opendistroforelasticsearch.knn.index.v206.KNNIndex.init(Native Method) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.v206.KNNIndex.loadIndex(KNNIndex.java:123) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNIndexCache.loadIndex(KNNIndexCache.java:307) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNIndexCache.lambda$getIndex$4(KNNIndexCache.java:154) ~[?:?]
    at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876) ~[?:?]
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155) ~[?:?]
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045) ~[?:?]
    at com.google.common.cache.LocalCache.get(LocalCache.java:3951) ~[?:?]
    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNIndexCache.getIndex(KNNIndexCache.java:154) ~[?:?]
    at com.amazon.opendistroforelasticsearch.knn.index.KNNWeight.scorer(KNNWeight.java:100) ~[?:?]
    at org.apache.lucene.search.Weight.bulkScorer(Weight.java:182) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:272) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:223) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28]
    at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:341) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:296) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:148) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:431) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService.access$500(SearchService.java:141) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:401) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.0.jar:7.10.0]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.0.jar:7.10.0]
    ... 6 more

Search on some other lighter indices does work.

The index seems to be correctly loaded:

GET matching_elastic:9200/_opendistro/_knn/stats -> 200 (78.758 ms)
%{
  "_nodes" => %{"failed" => 0, "successful" => 1, "total" => 1},
  "circuit_breaker_triggered" => false,
  "cluster_name" => "docker-cluster",
  "nodes" => %{
    "ljfmsm0zReyYcSgFIDtysg" => %{
      "cache_capacity_reached" => false,
      "eviction_count" => 0,
      "graph_index_errors" => 0,
      "graph_index_requests" => 88,
      "graph_memory_usage" => 5183801,
      "graph_memory_usage_percentage" => 84.751045,
      "graph_query_errors" => 0,
      "graph_query_requests" => 1953,
      "hit_count" => 1937,
      "indices_in_cache" => %{
        "20210210t155507z_catalog_5_vectormodel_86" => %{
          "graph_count" => 21,
          "graph_memory_usage" => 3222849,
          "graph_memory_usage_percentage" => 52.69103
        },
        "20210210t155507z_catalog_6_vectormodel_86" => %{
          "graph_count" => 19,
          "graph_memory_usage" => 1386285,
          "graph_memory_usage_percentage" => 22.664661
        },
        "20210210t155507z_catalog_7_vectormodel_86" => %{
          "graph_count" => 9,
          "graph_memory_usage" => 574667,
          "graph_memory_usage_percentage" => 9.39535
        }
      },
      "knn_query_requests" => 99,
      "load_exception_count" => 82,
      "load_success_count" => 49,
      "miss_count" => 131,
      "script_compilation_errors" => 0,
      "script_compilations" => 0,
      "script_query_errors" => 0,
      "script_query_requests" => 0,
      "total_load_time" => 15490000126
    }
  }
}

The index where search fails is the first one (and the heaviest one).

Feel free to ask me for more information if needed.

jmazanec15 commented 3 years ago

Thanks @tanguilp, I will take a look into this

jmazanec15 commented 3 years ago

@tanguilp how many docs are in this index?

Could you try re-indexing that index and attempting search again and let us know if the problem persists?

We will work to reproduce this issue.

tanguilp commented 3 years ago

There are 7 millions docs.

By the way the service was killed by OOM and restarted, and the problem disappeared. Closing for now, I'll reopen if I this problem happens again and I can have better logs about it. Thanks!

jmazanec15 commented 3 years ago

Thanks @tanguilp