Closed jordansissel closed 7 years ago
\ Copied my comment from the original issue **
Buried in the exception is this:
[Manticore::SocketTimeout] Read timed out
The request to your Elasticsearch server has timed out.
I feel that it is fair to consider elasticsearch "unreachable or down" in this scenario. The output will retry.
Based on the data so far, I think the behavior of the plugin seems correct.
I believe I have an explanation for this.
Conclusion so far: You are overloading your Elasticsearch cluster and causing requests from Logstash to timeout. Logstash is retrying, but otherwise correctly identifying a problem with Elasticsearch.
Recommendation: Avoid overloading Elasticsearch by having Logstash push more softly (reduce pipeline workers, reduce batch size, etc) or change your timeout setting to be very very high.
Buried in the exception is this: [Manticore::SocketTimeout] Read timed out The request to your Elasticsearch server has timed out.
I feel that it is fair to consider elasticsearch "unreachable or down" in this scenario. The output will retry.
While it may make sense to throw unreachable or down error based on what is coming back from Manticore if the request times out, it's not really intuitive to the end user. For folks like us who understands Elasticsearch thread pools, queues and such, we will know to look at the queue size and reduce the workers, etc.. to workaround it. For other users out there, they may look at the error message and think that ES is truly unreachable/down. In this case the requests are just queued up and not responding in the time allocated, so it will be helpful to clarify the error message :)
By the way, I tested the same set up on 2.3.4 using the same LS input (only difference being the syntax for Java and non-Java events), same LS config (other than workers setting which is present in 2.3.4) and running against the same ES 5.0 instance in the same environment. Using LS 2.3.4, by default the workers setting for the ES output is 1, so in order to simulate similar load, I have set the workers=>4 for its output to match the pipeline workers. With this setup on 2.3.4, ES active threads goes to 4 just like the LS 5.0 case, and queue length also goes up to 12-16. But on 2.3.4, the pipeline runs through with no warnings and exceptions at all ...
Hmmm, the weird thing here is that in the case of a full thread pool es should return a 429 yes? If it's not responding at all that's different. Are there errors in the ES log @ppf2 ?
The key difference here is that a 429 error will be retried but won't mark the endpoint as dead. The output will wait and retry it with another random endpoint.. A broken connection marks the endpoint as dead removing it from the pool. It will be retried in a similar way BUT the connection must also recover.
@ppf2 have you seen this since then? I should mention #523 may be the 'fix' for this situation.
Not recently, we can probably close this and can create a new ticket if the problem reoccurs even with 523. Thx!
I've started seeing this problem as soon as I've added this
max_open_files => "5000"
in input
-> file
section of logstash.conf
without that option, I've another message
Reached open files limit: 4095, set by the 'max_open_files' option or default, files yet to open: 12525
OS X: 10.124
javac: 1.8.0_111
logstash: 5.3.2
elasticsearch: Version: 5.4.0, Build: 780f8c4/2017-04-28T17:43:27.229Z, JVM: 1.8.0_111
* Moved from https://github.com/elastic/logstash/issues/6133, originally filed by @ppf2 *
LS + ES 5.0
ES 5.0 using all default settings (no customizations) LS config is simply an input plugin -> ES output (no filters), writing to a local ES 5.0 node
Running this on a laptop (mac) with 4 cores. After running for few minutes, bulk queue reaches 12+, but no rejections still.
The queue usage slowly increments to 19+, etc.. but Logstash started throwing connection exceptions (though ES is alive and well) around the 10-12 queued mark. ES memory usage is only at 30-40% (no long young or old gcs reported in the ES log) with load avg of around 4-5.
LS stats:
[2016-10-26T17:24:24,120][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}