Indexing timeout on staging

yrodiere commented 5 days ago

We got this last night:

  * Indexing failed: Failed to index data: HSEARCH700101: 17 failure(s) occurred during mass indexing. See the logs for details. First failure on entity 'Guide#https://pt.quarkus.io/version/2.16/guides/security-openid-connect-dev-services': HSEARCH400588: Call to the bulk REST API failed: HSEARCH400007: Elasticsearch request failed: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
Request: POST /_bulk with parameters {}
Response: (no response)

    <details>
      <summary>Exception details: <code>java.lang.IllegalStateException</code></summary>

    ```java
    java.lang.IllegalStateException: Failed to index data: HSEARCH700101: 17 failure(s) occurred during mass indexing. See the logs for details. First failure on entity 'Guide#https://pt.quarkus.io/version/2.16/guides/security-openid-connect-dev-services': HSEARCH400588: Call to the bulk REST API failed: HSEARCH400007: Elasticsearch request failed: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
    Request: POST /_bulk with parameters {}
    Response: (no response)
        at io.quarkus.search.app.indexing.IndexingService.indexAll(IndexingService.java:237)
        at io.quarkus.search.app.indexing.IndexingService.reindex(IndexingService.java:172)
        at io.quarkus.search.app.indexing.IndexingService_Subclass.reindex$$superforward(Unknown Source)
        at io.quarkus.search.app.indexing.IndexingService_Subclass$$function$$1.apply(Unknown Source)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:73)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:62)
        at io.quarkus.arc.impl.ActivateRequestContextInterceptor.invoke(ActivateRequestContextInterceptor.java:124)
        at io.quarkus.arc.impl.ActivateRequestContextInterceptor.aroundInvoke(ActivateRequestContextInterceptor.java:33)
        at io.quarkus.arc.impl.ActivateRequestContextInterceptor_Bean.intercept(Unknown Source)
        at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
        at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:27)
        at io.quarkus.search.app.indexing.IndexingService_Subclass.reindex(Unknown Source)
        at io.quarkus.search.app.indexing.IndexingService.lambda$indexOnStartup$5(IndexingService.java:99)
        at io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)
        at io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:28)
        at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
        at io.smallrye.mutiny.operators.uni.UniRunSubscribeOn.lambda$subscribe$0(UniRunSubscribeOn.java:27)
        at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:604)
        at org.jboss.threads.EnhancedQueueExecutor$Task.doRunWith(EnhancedQueueExecutor.java:2516)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2495)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1521)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:11)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:11)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:840)
    Caused by: org.hibernate.search.util.common.SearchException: HSEARCH700101: 17 failure(s) occurred during mass indexing. See the logs for details. First failure on entity 'Guide#https://pt.quarkus.io/version/2.16/guides/security-openid-connect-dev-services': HSEARCH400588: Call to the bulk REST API failed: HSEARCH400007: Elasticsearch request failed: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
    Request: POST /_bulk with parameters {}
    Response: (no response)
        at org.hibernate.search.mapper.pojo.massindexing.impl.PojoMassIndexingNotifier.reportIndexingCompleted(PojoMassIndexingNotifier.java:183)
        at org.hibernate.search.mapper.pojo.massindexing.impl.PojoMassIndexingBatchCoordinator.notifySuccess(PojoMassIndexingBatchCoordinator.java:266)
        at org.hibernate.search.mapper.pojo.massindexing.impl.PojoMassIndexingFailureHandledRunnable.run(PojoMassIndexingFailureHandledRunnable.java:122)
        at org.hibernate.search.mapper.pojo.massindexing.impl.PojoDefaultMassIndexer.startAndWait(PojoDefaultMassIndexer.java:153)
        at org.hibernate.search.mapper.pojo.standalone.massindexing.impl.StandalonePojoMassIndexer.startAndWait(StandalonePojoMassIndexer.java:77)
        at io.quarkus.search.app.indexing.IndexingService.indexAll(IndexingService.java:230)
        ... 25 more
    Caused by: org.hibernate.search.util.common.SearchException: HSEARCH400588: Call to the bulk REST API failed: HSEARCH400007: Elasticsearch request failed: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
    Request: POST /_bulk with parameters {}
    Response: (no response)
        at org.hibernate.search.backend.elasticsearch.orchestration.impl.ElasticsearchDefaultWorkSequenceBuilder$BulkedWorkExecutionState.onBulkWorkComplete(ElasticsearchDefaultWorkSequenceBuilder.java:239)
        at org.hibernate.search.util.common.impl.Futures.lambda$handler$3(Futures.java:99)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at org.hibernate.search.backend.elasticsearch.client.impl.ElasticsearchClientImpl$1.onFailure(ElasticsearchClientImpl.java:132)
        at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:688)
        at org.elasticsearch.client.RestClient$1.failed(RestClient.java:426)
        at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:137)
        at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.executionFailed(DefaultClientExchangeHandlerImpl.java:101)
        at org.apache.http.impl.nio.client.AbstractClientExchangeHandler.failed(AbstractClientExchangeHandler.java:432)
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
        ... 1 more
    Caused by: org.hibernate.search.util.common.SearchException: HSEARCH400007: Elasticsearch request failed: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
    Request: POST /_bulk with parameters {}
    Response: (no response)
        at org.hibernate.search.backend.elasticsearch.work.impl.AbstractNonBulkableWork.lambda$execute$2(AbstractNonBulkableWork.java:64)
        at org.hibernate.search.util.common.impl.Futures.lambda$handler$1(Futures.java:62)
        at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990)
        at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:974)
        ... 19 more
    Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
        ... 11 more

Originally posted by @quarkus-status-bot in https://github.com/quarkusio/search.quarkus.io/issues/131#issuecomment-2190241885

yrodiere commented 5 days ago

I can't find anything interesting in the OpenSearch logs, and metrics aren't exactly very precise... We can just see a spike during indexing:

My bet would be that the Hibernate Search indexing is pushing documents to the indexing queues at a higher rate than our previous ad-hoc indexing was, resulting in too much work being sent in parallel to OpenSearch. Then OpenSearch cannot keep up with the current resource constraints, and works end up waiting in the (OpenSearch) "indexing queue" (or however it's called) for too long (>30s), resulting in timeouts.

We could of course increase the timeout, but that would just hide the problem and waste memory: we obviously have too many documents in the OpenSearch indexing queue.

We could tune OpenSearch settings to improve throughput, but I don't know exactly what would be needed, so I think that would require too much investigation to be worth it. Though maybe raising the OpenShift CPU limit (not request) for OpenSearch would help, as it would allow higher spikes (and indexing definitely represents a spike).

Instead, I'd recommend at least lowering the parallelism (indexing queue count), and maybe also lowering the bulk size, so that treating a single bulk takes a shorter time, and a given bulk spends less time waiting in the OpenSearch indexing queue. Note these settings are environment specific and should be adjusted for staging and prod independently: OpenSearch has more resources on prod.

That being said... I'm starting to wonder whether we'd want some form of backpressure handling in the Hibernate Search Elasticsearch indexing queues... ideally when we get an error like this, Hibernate Search would realize Elasticsearch is saturated, and would temporarily throttle its output. Maybe we need some sort of Flow in there to do backpressure properly, but :x ... Anyway, do you think it would make sense to file a feature request for Hibernate Search @marko-bekhta ?

marko-bekhta commented 5 days ago

+1 and +1 and +1 😃

I had similar thoughts on the reason for this and the solution you are suggesting also makes sense to me. As for the feature in Search: it would be good to have something like that in. I've seen these Elasticsearch request failed: 30,000 milliseconds timeout on connection in the past and it would've been great to have something that deals with it rather than adding custom code and making schema changes 😄 to make indexing go faster (which itself isn't the bad thing but still 😅).

quarkusio / search.quarkus.io

Indexing timeout on staging #290