Unexpected warnings/errors about Elasticsearch being down

jordansissel commented 7 years ago

* Moved from https://github.com/elastic/logstash/issues/6133, originally filed by @ppf2 *

LS + ES 5.0

ES 5.0 using all default settings (no customizations) LS config is simply an input plugin -> ES output (no filters), writing to a local ES 5.0 node

input {
sfcases {
...
}
}

output{
  elasticsearch{
  user=>"logstash"
  password=>"password"
          index => "sfcases"
          document_type => "%{type}"
          document_id => "%{CaseId}"
          manage_template => false
          template_name => "sfcases"
  }
}

Running this on a laptop (mac) with 4 cores. After running for few minutes, bulk queue reaches 12+, but no rejections still.

node_name name                active queue rejected
Usmtk-W   bulk                     4    12        0

The queue usage slowly increments to 19+, etc.. but Logstash started throwing connection exceptions (though ES is alive and well) around the 10-12 queued mark. ES memory usage is only at 30-40% (no long young or old gcs reported in the ES log) with load avg of around 4-5.

[2016-10-26T17:59:32,314][WARN ][logstash.outputs.elasticsearch] Marking url as dead. {:reason=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :url=>#<URI::HTTP:0x1c6d5d09 URL:http://~hidden~:~hidden~@127.0.0.1:9200>, :error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2016-10-26T17:59:32,316][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2016-10-26T17:59:34,350][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2016-10-26T17:59:34,351][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2016-10-26T17:59:36,067][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:url=>#<URI::HTTP:0x13ad09be URL:http://~hidden~:~hidden~@127.0.0.1:9200>, :healthcheck_path=>"/"}
[2016-10-26T17:59:36,073][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<URI::HTTP:0x13ad09be URL:http://~hidden~:~hidden~@127.0.0.1:9200>}
[2016-10-26T17:59:40,441][WARN ][logstash.outputs.elasticsearch] Marking url as dead. {:reason=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :url=>#<URI::HTTP:0x2e4b65d0 URL:http://~hidden~:~hidden~@127.0.0.1:9200>, :error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2016-10-26T17:59:40,443][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2016-10-26T17:59:41,094][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:url=>#<URI::HTTP:0x22af0f3f URL:http://~hidden~:~hidden~@127.0.0.1:9200>, :healthcheck_path=>"/"}
[2016-10-26T17:59:41,101][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<URI::HTTP:0x22af0f3f URL:http://~hidden~:~hidden~@127.0.0.1:9200>}

LS stats:

{
  "host" : "Iceman",
  "version" : "5.0.0",
  "http_address" : "127.0.0.1:9600",
  "events" : {
    "in" : 2381,
    "filtered" : 2381,
    "out" : 1937,
    "duration_in_millis" : 940074
  },
  "jvm" : {
    "timestamp" : 1477530335474,
    "uptime_in_millis" : 419893,
    "memory" : {
      "heap_used_in_bytes" : 369429696,
      "heap_used_percent" : 17,
      "heap_committed_in_bytes" : 534585344,
      "heap_max_in_bytes" : 2077753344,
      "non_heap_used_in_bytes" : 203825664,
      "non_heap_committed_in_bytes" : 216940544,
      "pools" : {
        "survivor" : {
          "peak_used_in_bytes" : 8912896,
          "used_in_bytes" : 16776968,
          "peak_max_in_bytes" : 34865152,
          "max_in_bytes" : 69730304,
          "committed_in_bytes" : 17825792
        },
        "old" : {
          "peak_used_in_bytes" : 141354168,
          "used_in_bytes" : 278385592,
          "peak_max_in_bytes" : 724828160,
          "max_in_bytes" : 1449656320,
          "committed_in_bytes" : 373497856
        },
        "young" : {
          "peak_used_in_bytes" : 71630848,
          "used_in_bytes" : 74267136,
          "peak_max_in_bytes" : 279183360,
          "max_in_bytes" : 558366720,
          "committed_in_bytes" : 143261696
        }
      }
    }
  }
}

{
  "host" : "Iceman",
  "version" : "5.0.0",
  "http_address" : "127.0.0.1:9600",
  "pipeline" : {
    "events" : {
      "duration_in_millis" : 1503947,
      "in" : 2503,
      "filtered" : 2503,
      "out" : 2148
    },
    "plugins" : {
      "inputs" : [ ],
      "filters" : [ ],
      "outputs" : [ {
        "id" : "a8efd9b083816cc441eab95358c71f1113f5b8e4-2",
        "events" : {
          "in" : 2503,
          "out" : 2148
        },
        "name" : "elasticsearch"
      } ]
    },
    "reloads" : {
      "last_error" : null,
      "successes" : 0,
      "last_success_timestamp" : null,
      "last_failure_timestamp" : null,
      "failures" : 0
    }
  }
}

{
  "host" : "Iceman",
  "version" : "5.0.0",
  "http_address" : "127.0.0.1:9600",
  "hot_threads" : {
    "time" : "2016-10-26T18:16:47-07:00",
    "busiest_threads" : 3,
    "threads" : [ {
      "name" : "[main]<sfcases",
      "percent_of_cpu_time" : 1.94,
      "state" : "waiting",
      "traces" : [ "sun.misc.Unsafe.park(Native Method)", "java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)", "java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)", "java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)", "java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)", "sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)", "sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)", "java.lang.reflect.Method.invoke(Method.java:497)", "org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:451)", "org.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:312)" ]
    }, {
      "name" : "LogStash::Runner",
      "percent_of_cpu_time" : 1.92,
      "state" : "timed_waiting",
      "traces" : [ "java.lang.Object.wait(Native Method)", "java.lang.Thread.join(Thread.java:1253)", "org.jruby.internal.runtime.NativeThread.join(NativeThread.java:75)", "org.jruby.RubyThread.join(RubyThread.java:697)", "org.jruby.RubyThread$INVOKER$i$0$1$join.call(RubyThread$INVOKER$i$0$1$join.gen)", "org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:663)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:198)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:306)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:136)", "org.jruby.ast.CallNoArgNode.interpret(CallNoArgNode.java:60)" ]
    }, {
      "name" : "[main]>worker1",
      "percent_of_cpu_time" : 0.18,
      "state" : "runnable",
      "traces" : [ "java.net.SocketInputStream.socketRead0(Native Method)", "java.net.SocketInputStream.socketRead(SocketInputStream.java:116)", "java.net.SocketInputStream.read(SocketInputStream.java:170)", "java.net.SocketInputStream.read(SocketInputStream.java:141)", "org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)", "org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)", "org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)", "org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)", "org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)", "org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)" ]
    } ], if I lower -w to half the # of cores, then the queue usage falls to 0-3 and i don't see these connection warning and errors thrown by Manticore, which makes me wonder if the default worker and batch settings may be a bit aggressive.

Also, errors like the following can be misleading to the end user since ES is reachable and not down during this time :)

[2016-10-26T17:24:24,120][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://~hidden~:~hidden~@127.0.0.1:9200][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}

jordansissel commented 7 years ago

\ Copied my comment from the original issue **

Buried in the exception is this:

[Manticore::SocketTimeout] Read timed out

The request to your Elasticsearch server has timed out.

I feel that it is fair to consider elasticsearch "unreachable or down" in this scenario. The output will retry.

Based on the data so far, I think the behavior of the plugin seems correct.

jordansissel commented 7 years ago

I believe I have an explanation for this.

Logstash 5.0 ships with (logstash-output-elasticsearch v5.2.0)
This output (v5.1.0, #470, #471) moved the default request timeout from infinity to 60 seconds.
Your requests to Elasticsearch are timing out.

Conclusion so far: You are overloading your Elasticsearch cluster and causing requests from Logstash to timeout. Logstash is retrying, but otherwise correctly identifying a problem with Elasticsearch.

Recommendation: Avoid overloading Elasticsearch by having Logstash push more softly (reduce pipeline workers, reduce batch size, etc) or change your timeout setting to be very very high.

ppf2 commented 7 years ago

Buried in the exception is this: [Manticore::SocketTimeout] Read timed out The request to your Elasticsearch server has timed out.

I feel that it is fair to consider elasticsearch "unreachable or down" in this scenario. The output will retry.

While it may make sense to throw unreachable or down error based on what is coming back from Manticore if the request times out, it's not really intuitive to the end user. For folks like us who understands Elasticsearch thread pools, queues and such, we will know to look at the queue size and reduce the workers, etc.. to workaround it. For other users out there, they may look at the error message and think that ES is truly unreachable/down. In this case the requests are just queued up and not responding in the time allocated, so it will be helpful to clarify the error message :)

By the way, I tested the same set up on 2.3.4 using the same LS input (only difference being the syntax for Java and non-Java events), same LS config (other than workers setting which is present in 2.3.4) and running against the same ES 5.0 instance in the same environment. Using LS 2.3.4, by default the workers setting for the ES output is 1, so in order to simulate similar load, I have set the workers=>4 for its output to match the pipeline workers. With this setup on 2.3.4, ES active threads goes to 4 just like the LS 5.0 case, and queue length also goes up to 12-16. But on 2.3.4, the pipeline runs through with no warnings and exceptions at all ...

andrewvc commented 7 years ago

Hmmm, the weird thing here is that in the case of a full thread pool es should return a 429 yes? If it's not responding at all that's different. Are there errors in the ES log @ppf2 ?

andrewvc commented 7 years ago

The key difference here is that a 429 error will be retried but won't mark the endpoint as dead. The output will wait and retry it with another random endpoint.. A broken connection marks the endpoint as dead removing it from the pool. It will be retried in a similar way BUT the connection must also recover.

andrewvc commented 7 years ago

@ppf2 have you seen this since then? I should mention #523 may be the 'fix' for this situation.

ppf2 commented 7 years ago

Not recently, we can probably close this and can create a new ticket if the problem reoccurs even with 523. Thx!

nnnikolay commented 7 years ago

I've started seeing this problem as soon as I've added this

max_open_files => "5000"

in input -> file section of logstash.conf

without that option, I've another message

Reached open files limit: 4095, set by the 'max_open_files' option or default, files yet to open: 12525

OS X: 10.124 javac: 1.8.0_111 logstash: 5.3.2 elasticsearch: Version: 5.4.0, Build: 780f8c4/2017-04-28T17:43:27.229Z, JVM: 1.8.0_111

logstash-plugins / logstash-output-elasticsearch

Unexpected warnings/errors about Elasticsearch being down #498