snowplow / snowplow-elasticsearch-loader

Writes Snowplow enriched events from Kinesis to Elasticsearch
http://snowplowanalytics.com/
11 stars 18 forks source link

Timeout on converting Future to IO #242

Closed istreeter closed 2 years ago

istreeter commented 2 years ago

As described in #241, I have seen examples of logs and thread dumps where it seems the underlying elasticsearch client never calls the onFailure callback on the elastic4s client. It means the Future the loader is waiting on never gets resolved, and it causes the app to hang.

So far I have only seen this happen when calling the /_cluster/health endpoint (and we're removing those calls anyway, see #241); but if it can happen on one endpoint, then I'm worried it could also happen on the bulk index endpoint.

My proposed solution is to add a timeout when waiting for the Future to resolve, when indexing events. If we do not receive the http response (success or failure) before the timeout, then there should be a TimeoutException, which can get retried.

The timeout is only there to guard against this rare case when the callback is mysteriously never called. Therefore, I think we can set the timeout to something large, say 60 seconds. The underlying elasticsearch client already has timeouts configured for the tcp connection to the cluster.