Closed yrodiere closed 21 hours ago
I can't find anything interesting in the OpenSearch logs, and metrics aren't exactly very precise... We can just see a spike during indexing:
My bet would be that the Hibernate Search indexing is pushing documents to the indexing queues at a higher rate than our previous ad-hoc indexing was, resulting in too much work being sent in parallel to OpenSearch. Then OpenSearch cannot keep up with the current resource constraints, and works end up waiting in the (OpenSearch) "indexing queue" (or however it's called) for too long (>30s), resulting in timeouts.
We could of course increase the timeout, but that would just hide the problem and waste memory: we obviously have too many documents in the OpenSearch indexing queue.
We could tune OpenSearch settings to improve throughput, but I don't know exactly what would be needed, so I think that would require too much investigation to be worth it. Though maybe raising the OpenShift CPU limit (not request) for OpenSearch would help, as it would allow higher spikes (and indexing definitely represents a spike).
Instead, I'd recommend at least lowering the parallelism (indexing queue count), and maybe also lowering the bulk size, so that treating a single bulk takes a shorter time, and a given bulk spends less time waiting in the OpenSearch indexing queue. Note these settings are environment specific and should be adjusted for staging and prod independently: OpenSearch has more resources on prod.
That being said... I'm starting to wonder whether we'd want some form of backpressure handling in the Hibernate Search Elasticsearch indexing queues... ideally when we get an error like this, Hibernate Search would realize Elasticsearch is saturated, and would temporarily throttle its output. Maybe we need some sort of Flow
in there to do backpressure properly, but :x ... Anyway, do you think it would make sense to file a feature request for Hibernate Search @marko-bekhta ?
+1 and +1 and +1 😃
I had similar thoughts on the reason for this and the solution you are suggesting also makes sense to me. As for the feature in Search: it would be good to have something like that in. I've seen these Elasticsearch request failed: 30,000 milliseconds timeout on connection
in the past and it would've been great to have something that deals with it rather than adding custom code and making schema changes 😄 to make indexing go faster (which itself isn't the bad thing but still 😅).
We got this last night:
Originally posted by @quarkus-status-bot in https://github.com/quarkusio/search.quarkus.io/issues/131#issuecomment-2190241885