Can't connect to ES with protocol => "node" / LS 2.0.0-1

ceeeekay commented 9 years ago

Hi there,

I have a previously working config from Logstash 2.0.0-rc1-1 which was successfully indexing to Elasticsearch using the node protocol with logstash-output-elasticsearch_java.

After upgrading to Logstash 2.0.0-1 it can no longer connect to ES using the node protocol.

Logstash reports:

Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master]; {:level=>:error, :file=>"logstash/outputs/elasticsearch_java.rb", :line=>"478", :method=>"flush"}

There are no errors in the logs on the ES master node, or issues with network connectivity. The ES cluster is green, and responding at all times.

If I change the protocol to "transport", Logstash connects and indexes, however this upsets our monitoring as the Logstash nodes no longer exist in the cluster.

Downgrading Logstash to 2.0.0-rc1-1 immediately resolves the problem, with no other changes required.

Can someone please confirm that node protocol works correctly with Logstash 2.0.0-1?

ceeeekay commented 9 years ago

I can confirm this is still occurring with Logstash 2.1.0-1 and logstash-output-elasticsearch_java 2.0.2

The logstash node connects to the cluster and is visible in the node list, but reports [SERVICE_UNAVAILABLE/2/no master].

jordansissel commented 9 years ago

@ceeeekay with what version of elasticsearch? Can you attach your Logstash and Elasticsearch server configs?

ceeeekay commented 9 years ago

@jordansissel currently ES 2.0.0 - looking to upgrade today.

LS:

output { elasticsearch_java { hosts => [ "dev-master1", "dev-master2", "dev-master3" ] node_name => "Logstash 1" network_host => "10.2.15.51" cluster => "dev" index => "%{[@metadata][index_prefix]}-%{client_id}-%{+YYYY.MM.dd}" protocol => "node" template_overwrite => "false" manage_template=> "false" } }

ES - data node:

cluster.name: dev node.name: dev-search1 node.master: false node.data: true index.number_of_replicas: 0 index.refresh_interval: 30s path.repo: ["/var/lib/es-snapshot/kibana-index"] network.bind_host: 0.0.0.0 network.publish_host: 10.2.15.61 transport.profiles.client: port: 9500-9600 type: client http.port: 9201 gateway.recover_after_nodes: 2 gateway.recover_after_time: 5m gateway.expected_nodes: 3 discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["dev-master1", "dev-master2", "dev-master3"] bootstrap.mlockall: true marvel.agent.enabled: true marvel.agent.exporter.es.hosts: "localhost:9201" index.translog.flush_threshold_ops: 50000 indices.recovery.max_bytes_per_sec : "100mb" action.disable_delete_all_indices: true

threadpool.search.type: fixed threadpool.search.size: 20 threadpool.search.queue_size: 100

threadpool.bulk.type: fixed threadpool.bulk.size: 60 threadpool.bulk.queue_size: 300

threadpool.index.type: fixed threadpool.index.size: 20 threadpool.index.queue_size: 100

jordansissel commented 9 years ago

network_host => "10.2.15.51"

Is this setting correct? Can all Elasticsearch nodes talk to Logstash on this IP? Are the ports for Elasticsearch's node client open on the Logstash server (firewall, etc).

network_host => "10.2.15.51"

Can the Logstash server reach Elasticsearch on this IP? Are the ports open from Logstash to Elasticsearch? For all nodes?

I'm guessing this is a networking or network configuration issue for now.

ceeeekay commented 9 years ago

This is the correct IP for the LS node. All the dev hosts have a single interface on the same /24, with no firewalling or routing. This setting is only present so it matches our production configs, as the nodes have a separate VLAN for ES comms.

Note that simply downgrading LS to 2.0.0-rc1-1 and reinstalling the plugin resolves the problem, with no config changes.

jordansissel commented 9 years ago

Ok that's useful info. Thanks! Hopefully we can track this down and figure out what's causing it.

ceeeekay commented 9 years ago

Thanks whack :)

Just upgraded everything to ES 2.1 - issue persists.

andrewvc commented 9 years ago

@ceeeekay @jordansissel confirmed with LS 2.0 release. I just rolled a beta gem to fix this.

This does currently work on master, but master is blocked on some bugs I found today. (there is one bug that will be fixed by #33 and #34 which I just submitted ) that is a blocker for master going out as a release.

@ceeeekay can you try editing the file named Gemfile that ships with logstash. Find the line for logstash-output-elasticsearch_java and replace it with gem "logstash-output-elasticsearch_java", "2.1.1.beta1". Then run bin/plugin install.

That should fix it up. Please let us know if this works.

ceeeekay commented 9 years ago

@andrewvc I can't get v2.1.1.beta1 of the plugin to install.

I've replaced the plugin entry in /opt/logstash/Gemfile with gem "logstash-output-elasticsearch_java", "2.1.1.beta1", however running bin/plugin install logstash-output-elasticsearch_java causes this entry to revert to just gem "logstash-output-elasticsearch_java", and 2.0.2 is installed.

Am I doing something wrong here?

andrewvc commented 9 years ago

My apologies @ceeeekay you'll need to run bin/plugin install --no-verify for that to work!

ceeeekay commented 9 years ago

Hi @andrewvc - I can't get this plugin installed no matter what I try.

The complete command I'm using is bin/plugin install --no-verify logstash-output-elasticsearch_java

I've tried installing it to a fresh Logstash installation, as well as over the top of the 2.0.2 plugin. The entry in the Gemfile is consistently reverted to the default, and I end up with 2.0.2 installed.

I've also tried bin/plugin update logstash-output-elasticsearch_java (with the updated Gemfile) with no success.

Sorry if this is basic stuff - I've been using logstash for quite a while now but I don't really understand the mechanics of the plugin installer.

ceeeekay commented 9 years ago

@andrewvc Well, typically, I got it to install just after I replied - I didn't realise bin/plugin install --no-verify needs no arguments.

I'm still seeing the same errors but with a different format.

Here are the steps so far, from a clean Logstash install:

root@dev-index1:/opt/logstash# bin/plugin list --verbose logstash-output-elasticsearch_java ERROR: No plugins found

root@dev-index1:/opt/logstash# tail -1 Gemfile gem "logstash-output-elasticsearch_java", "2.1.1.beta1"

root@dev-index1:/opt/logstash# bin/plugin install --no-verify Installing... Installation successful

root@dev-index1:/opt/logstash# bin/plugin list --verbose logstash-output-elasticsearch_java logstash-output-elasticsearch_java (2.1.1.beta1)

Start Logstash

Node is seen to join cluster

(on active master...)

root@dev-master3:~# curl localhost:9201/_cat/master 5jIGag1LToi6GHsxVh5qtw 10.2.15.73 10.2.15.73 dev-master3

root@dev-master3:~# curl -s localhost:9201/_cat/nodes | grep Logstash 10.2.15.51 10.2.15.51 7 91 0.13 c - Logstash 1

I note the Logstash node doesn't report a node ID. Is this significant?

(on Logstash node...)

Attempted to send a bulk request to Elasticsearch configured at '["dev-master1", "dev-master2", "dev-master3"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided? {:client_config=>{:port=>9300, :protocol=>"node", :client_settings=>{"cluster.name"=>"dev", "network.host"=>"10.2.15.51", "client.transport.sniff"=>false, "node.name"=>"Logstash 1"}, :hosts=>["dev-master1", "dev-master2", "dev-master3"]}, :error_message=>"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];...

ES cluster is green and responding to queries. No indexing is taking place at this point, and the LS error repeats periodically. No sign of any errors in the active master's Elasticsearch log.

andrewvc commented 9 years ago

@ceeeekay that is an interesting error. Are you certain that the cluster name 'dev' you're providing to logstash is 100% identical to that of your cluster?

I'm going to ask around today to see if I can find some alternate explanations.

ceeeekay commented 9 years ago

@andrewvc the cluster name in the config is definitely correct. Switching back to transport protocol works fine with the same settings. I'd assume that if any of it was incorrect the Logstash node would not appear in the node list. Also note that all of this config was previously working with logstash 2.0-rc1.

andrewvc commented 9 years ago

@ceeeekay and is this cluster using Elasticsearch 2.0 or 2.1? The latest jar is built against Elasticsearch 2.1

ceeeekay commented 9 years ago

@andrewvc the cluster was upgraded to 2.1 last week.

andrewvc commented 9 years ago

@ceeeekay can you confirm that your Elasticsearch nodes can talk to your Logstash instance at the ip provided in 'network_host' and that nothing is blocking that (e.g. a firewall).

andrewvc commented 9 years ago

@ceeeekay I noticed in your config that your Elasticsearch config sets the port range to be 9500-9600 , but your logstash config does not specify ports at all.

Is that the actual config you're using? If that's the case perhaps your logstash is connecting to a different elasticsearch instance on the default port range (9300-9399)

ceeeekay commented 9 years ago

@andrewvc There's no firewall here. This is a dev stack and is on a single isolated VLAN with a /24. The port specification is only for transport clients as I wanted to move them away from the node traffic on 9300 as per https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html#_tcp_transport_profiles (note that pasting the config has killed the indents. The port directive is part of transport.profiles.client).

This has all previously been working, and is still working in prod. The only thing that's changed has been to upgrade Logstash and Elasticsearch on dev. Downgrading LS to 2.0-rc1 resolves the problem, as does switching to transport protocol.

andrewvc commented 9 years ago

@ceeeekay can you send the full output of http://localhost:9200/_cat/nodes?

BTW, the logstash node does have a node_id, it just isn't returned by the _cat/nodes API (that just returns names, not ids).

Currently I'm not having great luck reproducing this.

andrewvc commented 9 years ago

By the way, @ceeeekay can you reproduce this locally against a single Elasticsearch box?

ceeeekay commented 9 years ago

@andrewvc Do you want the node output while I'm attempting to use the node protocol? It's currently using transport.

andrewvc commented 9 years ago

@ceeeekay since this is so mysterious, why not just throw in both for good measure :)

ceeeekay commented 9 years ago

Node:

$ curl localhost:9201/_cat/nodes 10.2.15.71 10.2.15.71 2 92 0.01 - m dev-master1 10.2.15.72 10.2.15.72 7 91 1.68 - m dev-master2 10.2.15.73 10.2.15.73 5 91 0.12 - * dev-master3 10.2.15.61 10.2.15.61 46 65 0.32 d - dev-search1 10.2.15.81 10.2.15.81 6 81 0.65 - - dev-query1 10.2.15.51 10.2.15.51 6 87 0.28 c - Logstash 1 10.2.15.11 10.2.15.11 6 96 2.83 - - dev-mgmt1

Transport:

$ curl localhost:9201/_cat/nodes 10.2.15.73 10.2.15.73 6 91 0.20 - * dev-master3 10.2.15.11 10.2.15.11 6 95 7.97 - - dev-mgmt1 10.2.15.72 10.2.15.72 1 92 1.24 - m dev-master2 10.2.15.81 10.2.15.81 8 81 0.21 - - dev-query1 10.2.15.71 10.2.15.71 3 92 0.11 - m dev-master1 10.2.15.61 10.2.15.61 44 67 0.21 d - dev-search1

I'll try reproducing it with a single ES node shortly.

andrewvc commented 9 years ago

By the way @ceeeekay can you paste the full error when using the node protocol? I noticed you truncated it in https://github.com/logstash-plugins/logstash-output-elasticsearch_java/issues/28#issuecomment-160017810

I'm wondering if there might be extra info there that's useful.

ceeeekay commented 9 years ago

@andrewvc

Attempted to send a bulk request to Elasticsearch configured at '["dev-master1", "dev-master2", "dev-master3"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided? {:client_config=>{:port=>9300, :protocol=>"node", :client_settings=>{"cluster.name"=>"dev", "network.host"=>"10.2.15.51", "client.transport.sniff"=>false, "node.name"=>"Logstash 1"}, :hosts=>["dev-master1", "dev-master2", "dev-master3"]}, :error_message=>"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :error_class=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:154)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:144)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:212)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:71)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:150)", "org.elasticsearch.action.support.ThreadedActionListener$2.doRun(org/elasticsearch/action/support/ThreadedActionListener.java:104)", "org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org/elasticsearch/common/util/concurrent/AbstractRunnable.java:37)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:error}

blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master]; {:class=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:154)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:144)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:212)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:71)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:150)", "org.elasticsearch.action.support.ThreadedActionListener$2.doRun(org/elasticsearch/action/support/ThreadedActionListener.java:104)", "org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org/elasticsearch/common/util/concurrent/AbstractRunnable.java:37)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

jordansissel commented 9 years ago

@ceeeekay @andrewvc and I are having a tough time trying to reproduce this, and we're both unable to find a reason why this would occur for you. I think we need more data:

1) Can you give us a quick matrix of what versions of LS and ES and logstash-output-elasticsearch_java configurations work and do not work for you? (ie; LS 2.0.0-rc1, ES vX.Y.Z, logstash-output-elasticsearch_java vX.Y.Z)

2) My hunch is around networking since the error is about having no master: SERVICE_UNAVAILABLE/2/no master -- Can you try making the port range the default values (comment it out) in your Elasticsearch config for your whole cluster? Note: This is simply a suggestion to see if we can get another data point for investigation and is not a suggestion for a permanent solution - I believe there's something buggy going on here, and knowing if the default port range solves it will help us debug this.

3) Are you using Shield on this ES cluster?

ceeeekay commented 9 years ago

@jordansissel @andrewvc

While going back and reinstalling everything to make up the matrix I discovered that the issue seems to be caused by the license plugin (I'm running marvel-agent on dev).

I had to uninstall it to downgrade ES, and things just magically started working.

For the sake of completeness, I've tried the following, which all work fine:

LS 2.0.0-rc1, ES 2.0.1, es_java 2.0.2 LS 2.0.0-1, ES 2.0.1, es_java 2.0.2 LS 2.1.0-1, ES 2.0.1, es_java 2.0.2 LS 2.1.0-1, ES 2.1.0, es_java 2.0.2

While running LS 2.1.0-1, ES 2.1.0, es_java 2.0.2 I simply installed the license plugin, without marvel-agent, and the problem reappeared.

Can you guys replicate that?

Thanks for all your help so far.

andrewvc commented 9 years ago

@ceeeekay interesting, I believe that may fall under known behavior. This is one reason we're strongly considering removing node from the java output. It doesn't provide any speed benefits for Logstash and it exponentially increases the complexity of configuration.

FWIW we highly recommend using the standard HTTP output unless you've benchmarked and proven that transport is faster for your workload (and in that case we'd love to hear your story!). The performance difference between HTTP and transport/node is <3% generally (once you set workers to be > 1).

andrewvc commented 9 years ago

@ceeeekay I'll need one more piece of info. Is your license active or expired? Which plugins are enabled (marvel / shield / watcher).

ceeeekay commented 9 years ago

@andrewvc the license is an active free Marvel license, but no plugins that require licenses are currently installed. I'll update on the reasons we use node shortly.

suyograo commented 9 years ago

@ceeeekay any reason why you can't switch to transport or http protocol?

suyograo commented 9 years ago

@ceeeekay For node protocol to work with Licensed plugins, you need to install an extra license plugin on Logstash. Thats why we don't recommend using node with Licensed plugins and prefer http or transport.

If you still like to use node protocol:

To install the Logstash License plugin:

Shutdown the Logstash instance(s) that are shipping data to Elasticsearch.
Run bin/plugin install to install the Logstash license plugin:

bin/plugin install logstash-output-elasticsearch-license

ceeeekay commented 9 years ago

@suyograo excellent. I'll get this installed and tested soon and get back to you guys my thinking on using node (I'm away from the office at the moment).

ceeeekay commented 9 years ago

@suyograo running LS 2.1.0, trying to bin/plugin install logstash-output-elasticsearch-license I get

ERROR: Installation Aborted, message: Bundler could not find compatible versions for gem "logstash-output-elasticsearch": In snapshot (Gemfile.lock): logstash-output-elasticsearch (= 2.1.4)

In Gemfile: logstash-output-elasticsearch-license (>= 0) java depends on logstash-output-elasticsearch (~> 0.2) java

logstash-output-elasticsearch (>= 0) java

This is with logstash-output-elasticsearch_java 2.0.2 already installed

Any ideas? Is there another version I can install?

andrewvc commented 9 years ago

@ceeeekay we're working on delivering a fixed version. In the meantime, what's your reason for using Node? We're genuinely curious when people use the Node protocol.

ceeeekay commented 9 years ago

@andrewvc @suyograo

Throughput per node is the least of our worries - we're not trying to stay with node protocol for such a small speed gain.

Node protocol allows us to simplify our configuration by having the Logstash nodes connect to the masters to discover the rest of the cluster, which is also the discovery configuration we use for all our Elasticsearch nodes. Of all the nodes, the masters are the only ones we can absolutely guarantee will be present, as we treat all other nodes as somewhat expendable, as all other ES node types have at least one redundant node (more than one for data nodes).

Obviously we can't do this with HTTP, as it would put unnecessary load on the masters, which is not best-practice.

We also don't want to do this with transport protocol, for a similar reason. As I understand it, the transport connection would be a two-hop via the initial node, and then to the node with the particular shard we want to index to - so more unnecessary load on the masters if they're the first point of contact.

The only way to avoid load on the masters which doesn't require specific data nodes to be available would be to run a dedicated gateway node for HTTP or transport. In the interests of redundancy this would require two or more nodes, just to allow the use of a different protocol - which seems wasteful. This also seems like somewhat of a bottleneck, and not a great design for a scalable system.

Using node allows us to add theoretically unlimited Logstash nodes - without the traffic passing through any sort of bottleneck, as Logstash is connecting directly to every node in the ES cluster. As we are reading from a queue upstream of the Logstash indexers, we also find that this load balances between indexer nodes quite naturally, without the need for us to configure or plan anything, i.e., if we max out our Logstash throughput, we simply add more nodes.

We are also running monitoring on our management node which allows us to easily see which Logstash nodes are connected to the cluster. We would lose this with transport or HTTP.

There are probably some other minor points here, but changing protocols would essentially require us to change our architecture to get similar results, and I feel we would lose some of the resiliency, scalability and simplicity we have at the moment.

I hope that covers most of the reasons why we have a strong preference to stay with the node protocol. I can clarify any of this if you like, either here or in #logstash (BaM`).

Cheers, Chris

andrewvc commented 8 years ago

@ceeeekay those are all good points! Thank you so much for being so patient with these issues and describing your very sensible reasoning behind using node.

The good news is that the transport protocol should do what you want! The transport docs are however incorrect! I've just opened a new ticket on Elasticsearch core to fix the docs: https://github.com/elastic/elasticsearch/pull/15204

ceeeekay commented 8 years ago

@andrewvc thanks - that sounds great, but I have some transport questions:

Is two-hop an issue via data nodes, performance-wise (i.e., does it add any unnecessary load to the ES cluster)?

How does Logstash with transport react if a data node goes away while it's indexing? Does it update the cluster topology immediately, or switch to another surviving node, or does it drop messages until it rescans the cluster? Does it even rescan? I could test this myself but my dev stack only has a single data node.

I'd like to make this as robust as possible and not have to worry about what Logstash is doing if the ES cluster changes in any way.

The only other thing missing is that we would lose visibility of which Logstash nodes are connected and indexing if we switch to transport. It's not a huge deal, but this does make things easier to monitor and manage.

andrewvc commented 8 years ago

@ceeeekay performance-wise the impact is negligible. I ran some benchmarks a while back and couldn't find any difference. Even HTTP is only ~2-3% slower than the native protocols (which is why we made it the default).

andrewvc commented 8 years ago

@ceeeekay with regard to what happens if a data node goes away it will switch to another node. However, you may get an error for that request. That being said, the logstash plugin will issue a retry either way. I don't believe node is any more robust in this regard (you'd get an error there too AFAIK).

We've had this discussion internally with members of Elasticsearch core and they much prefer the transport protocol over the node protocol.

WRT to losing logstash visibility that is an interesting point! We hadn't considered that!

I'm going to close this ticket for now since the original issue has been resolved.

Thank you very much for being so in detail @ceeeekay , these are the tickets that I really enjoy responding to!

ceeeekay commented 8 years ago

@andrewvc @suyograo @jordansissel Thank you all for for your help with this - I really appreciate it.

@andrewvc Is there any ETA on a working logstash-output-elasticsearch-license? I'd still like to see how far I can get with it - mostly so we can monitor our Logstash nodes via the cluster.

andrewvc commented 8 years ago

@suyograo do you have an answer for @ceeeekay here wrt the license plugin?

logstash-plugins / logstash-output-elasticsearch_java

Can't connect to ES with protocol => "node" / LS 2.0.0-1 #28