Closed ceeeekay closed 8 years ago
I can confirm this is still occurring with Logstash 2.1.0-1 and logstash-output-elasticsearch_java 2.0.2
The logstash node connects to the cluster and is visible in the node list, but reports [SERVICE_UNAVAILABLE/2/no master]
.
@ceeeekay with what version of elasticsearch? Can you attach your Logstash and Elasticsearch server configs?
@jordansissel currently ES 2.0.0 - looking to upgrade today.
LS:
output { elasticsearch_java { hosts => [ "dev-master1", "dev-master2", "dev-master3" ] node_name => "Logstash 1" network_host => "10.2.15.51" cluster => "dev" index => "%{[@metadata][index_prefix]}-%{client_id}-%{+YYYY.MM.dd}" protocol => "node" template_overwrite => "false" manage_template=> "false" } }
ES - data node:
cluster.name: dev node.name: dev-search1 node.master: false node.data: true index.number_of_replicas: 0 index.refresh_interval: 30s path.repo: ["/var/lib/es-snapshot/kibana-index"] network.bind_host: 0.0.0.0 network.publish_host: 10.2.15.61 transport.profiles.client: port: 9500-9600 type: client http.port: 9201 gateway.recover_after_nodes: 2 gateway.recover_after_time: 5m gateway.expected_nodes: 3 discovery.zen.minimum_master_nodes: 2 discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["dev-master1", "dev-master2", "dev-master3"] bootstrap.mlockall: true marvel.agent.enabled: true marvel.agent.exporter.es.hosts: "localhost:9201" index.translog.flush_threshold_ops: 50000 indices.recovery.max_bytes_per_sec : "100mb" action.disable_delete_all_indices: true
threadpool.search.type: fixed threadpool.search.size: 20 threadpool.search.queue_size: 100
threadpool.bulk.type: fixed threadpool.bulk.size: 60 threadpool.bulk.queue_size: 300
threadpool.index.type: fixed threadpool.index.size: 20 threadpool.index.queue_size: 100
network_host => "10.2.15.51"
Is this setting correct? Can all Elasticsearch nodes talk to Logstash on this IP? Are the ports for Elasticsearch's node client open on the Logstash server (firewall, etc).
network_host => "10.2.15.51"
Can the Logstash server reach Elasticsearch on this IP? Are the ports open from Logstash to Elasticsearch? For all nodes?
I'm guessing this is a networking or network configuration issue for now.
This is the correct IP for the LS node. All the dev hosts have a single interface on the same /24, with no firewalling or routing. This setting is only present so it matches our production configs, as the nodes have a separate VLAN for ES comms.
Note that simply downgrading LS to 2.0.0-rc1-1 and reinstalling the plugin resolves the problem, with no config changes.
Ok that's useful info. Thanks! Hopefully we can track this down and figure out what's causing it.
Thanks whack :)
Just upgraded everything to ES 2.1 - issue persists.
@ceeeekay @jordansissel confirmed with LS 2.0 release. I just rolled a beta gem to fix this.
This does currently work on master, but master is blocked on some bugs I found today. (there is one bug that will be fixed by #33 and #34 which I just submitted ) that is a blocker for master going out as a release.
@ceeeekay can you try editing the file named Gemfile
that ships with logstash. Find the line for logstash-output-elasticsearch_java
and replace it with gem "logstash-output-elasticsearch_java", "2.1.1.beta1"
. Then run bin/plugin install
.
That should fix it up. Please let us know if this works.
@andrewvc I can't get v2.1.1.beta1 of the plugin to install.
I've replaced the plugin entry in /opt/logstash/Gemfile
with gem "logstash-output-elasticsearch_java", "2.1.1.beta1"
, however running bin/plugin install logstash-output-elasticsearch_java
causes this entry to revert to just gem "logstash-output-elasticsearch_java"
, and 2.0.2 is installed.
Am I doing something wrong here?
My apologies @ceeeekay you'll need to run bin/plugin install --no-verify
for that to work!
Hi @andrewvc - I can't get this plugin installed no matter what I try.
The complete command I'm using is bin/plugin install --no-verify logstash-output-elasticsearch_java
I've tried installing it to a fresh Logstash installation, as well as over the top of the 2.0.2 plugin. The entry in the Gemfile is consistently reverted to the default, and I end up with 2.0.2 installed.
I've also tried bin/plugin update logstash-output-elasticsearch_java
(with the updated Gemfile) with no success.
Sorry if this is basic stuff - I've been using logstash for quite a while now but I don't really understand the mechanics of the plugin installer.
@andrewvc Well, typically, I got it to install just after I replied - I didn't realise bin/plugin install --no-verify
needs no arguments.
I'm still seeing the same errors but with a different format.
Here are the steps so far, from a clean Logstash install:
root@dev-index1:/opt/logstash# bin/plugin list --verbose logstash-output-elasticsearch_java ERROR: No plugins found
root@dev-index1:/opt/logstash# tail -1 Gemfile gem "logstash-output-elasticsearch_java", "2.1.1.beta1"
root@dev-index1:/opt/logstash# bin/plugin install --no-verify Installing... Installation successful
root@dev-index1:/opt/logstash# bin/plugin list --verbose logstash-output-elasticsearch_java logstash-output-elasticsearch_java (2.1.1.beta1)
- Start Logstash
- Node is seen to join cluster
(on active master...)
root@dev-master3:~# curl localhost:9201/_cat/master 5jIGag1LToi6GHsxVh5qtw 10.2.15.73 10.2.15.73 dev-master3
root@dev-master3:~# curl -s localhost:9201/_cat/nodes | grep Logstash 10.2.15.51 10.2.15.51 7 91 0.13 c - Logstash 1
I note the Logstash node doesn't report a node ID. Is this significant?
(on Logstash node...)
Attempted to send a bulk request to Elasticsearch configured at '["dev-master1", "dev-master2", "dev-master3"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided? {:client_config=>{:port=>9300, :protocol=>"node", :client_settings=>{"cluster.name"=>"dev", "network.host"=>"10.2.15.51", "client.transport.sniff"=>false, "node.name"=>"Logstash 1"}, :hosts=>["dev-master1", "dev-master2", "dev-master3"]}, :error_message=>"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];...
ES cluster is green and responding to queries. No indexing is taking place at this point, and the LS error repeats periodically. No sign of any errors in the active master's Elasticsearch log.
@ceeeekay that is an interesting error. Are you certain that the cluster name 'dev' you're providing to logstash is 100% identical to that of your cluster?
I'm going to ask around today to see if I can find some alternate explanations.
@andrewvc the cluster name in the config is definitely correct. Switching back to transport protocol works fine with the same settings. I'd assume that if any of it was incorrect the Logstash node would not appear in the node list. Also note that all of this config was previously working with logstash 2.0-rc1.
@ceeeekay and is this cluster using Elasticsearch 2.0 or 2.1? The latest jar is built against Elasticsearch 2.1
@andrewvc the cluster was upgraded to 2.1 last week.
@ceeeekay can you confirm that your Elasticsearch nodes can talk to your Logstash instance at the ip provided in 'network_host' and that nothing is blocking that (e.g. a firewall).
@ceeeekay I noticed in your config that your Elasticsearch config sets the port range to be 9500-9600 , but your logstash config does not specify ports at all.
Is that the actual config you're using? If that's the case perhaps your logstash is connecting to a different elasticsearch instance on the default port range (9300-9399)
@andrewvc There's no firewall here. This is a dev stack and is on a single isolated VLAN with a /24. The port specification is only for transport clients as I wanted to move them away from the node traffic on 9300 as per https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html#_tcp_transport_profiles (note that pasting the config has killed the indents. The port directive is part of transport.profiles.client).
This has all previously been working, and is still working in prod. The only thing that's changed has been to upgrade Logstash and Elasticsearch on dev. Downgrading LS to 2.0-rc1 resolves the problem, as does switching to transport protocol.
@ceeeekay can you send the full output of http://localhost:9200/_cat/nodes
?
BTW, the logstash node does have a node_id, it just isn't returned by the _cat/nodes
API (that just returns names, not ids).
Currently I'm not having great luck reproducing this.
By the way, @ceeeekay can you reproduce this locally against a single Elasticsearch box?
@andrewvc Do you want the node output while I'm attempting to use the node protocol? It's currently using transport.
@ceeeekay since this is so mysterious, why not just throw in both for good measure :)
Node:
$ curl localhost:9201/_cat/nodes 10.2.15.71 10.2.15.71 2 92 0.01 - m dev-master1 10.2.15.72 10.2.15.72 7 91 1.68 - m dev-master2 10.2.15.73 10.2.15.73 5 91 0.12 - * dev-master3 10.2.15.61 10.2.15.61 46 65 0.32 d - dev-search1 10.2.15.81 10.2.15.81 6 81 0.65 - - dev-query1 10.2.15.51 10.2.15.51 6 87 0.28 c - Logstash 1 10.2.15.11 10.2.15.11 6 96 2.83 - - dev-mgmt1
Transport:
$ curl localhost:9201/_cat/nodes 10.2.15.73 10.2.15.73 6 91 0.20 - * dev-master3 10.2.15.11 10.2.15.11 6 95 7.97 - - dev-mgmt1 10.2.15.72 10.2.15.72 1 92 1.24 - m dev-master2 10.2.15.81 10.2.15.81 8 81 0.21 - - dev-query1 10.2.15.71 10.2.15.71 3 92 0.11 - m dev-master1 10.2.15.61 10.2.15.61 44 67 0.21 d - dev-search1
I'll try reproducing it with a single ES node shortly.
By the way @ceeeekay can you paste the full error when using the node protocol? I noticed you truncated it in https://github.com/logstash-plugins/logstash-output-elasticsearch_java/issues/28#issuecomment-160017810
I'm wondering if there might be extra info there that's useful.
@andrewvc
Attempted to send a bulk request to Elasticsearch configured at '["dev-master1", "dev-master2", "dev-master3"]', but an error occurred and it failed! Are you sure you can reach elasticsearch from this machine using the configuration provided? {:client_config=>{:port=>9300, :protocol=>"node", :client_settings=>{"cluster.name"=>"dev", "network.host"=>"10.2.15.51", "client.transport.sniff"=>false, "node.name"=>"Logstash 1"}, :hosts=>["dev-master1", "dev-master2", "dev-master3"]}, :error_message=>"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];", :error_class=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:154)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:144)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:212)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:71)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:150)", "org.elasticsearch.action.support.ThreadedActionListener$2.doRun(org/elasticsearch/action/support/ThreadedActionListener.java:104)", "org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org/elasticsearch/common/util/concurrent/AbstractRunnable.java:37)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:error}
blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master]; {:class=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:154)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:144)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:212)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:71)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:150)", "org.elasticsearch.action.support.ThreadedActionListener$2.doRun(org/elasticsearch/action/support/ThreadedActionListener.java:104)", "org.elasticsearch.common.util.concurrent.AbstractRunnable.run(org/elasticsearch/common/util/concurrent/AbstractRunnable.java:37)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:615)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
@ceeeekay @andrewvc and I are having a tough time trying to reproduce this, and we're both unable to find a reason why this would occur for you. I think we need more data:
1) Can you give us a quick matrix of what versions of LS and ES and logstash-output-elasticsearch_java configurations work and do not work for you? (ie; LS 2.0.0-rc1, ES vX.Y.Z, logstash-output-elasticsearch_java vX.Y.Z)
2) My hunch is around networking since the error is about having no master: SERVICE_UNAVAILABLE/2/no master
-- Can you try making the port range the default values (comment it out) in your Elasticsearch config for your whole cluster? Note: This is simply a suggestion to see if we can get another data point for investigation and is not a suggestion for a permanent solution - I believe there's something buggy going on here, and knowing if the default port range solves it will help us debug this.
3) Are you using Shield on this ES cluster?
@jordansissel @andrewvc
While going back and reinstalling everything to make up the matrix I discovered that the issue seems to be caused by the license plugin (I'm running marvel-agent on dev).
I had to uninstall it to downgrade ES, and things just magically started working.
For the sake of completeness, I've tried the following, which all work fine:
LS 2.0.0-rc1, ES 2.0.1, es_java 2.0.2 LS 2.0.0-1, ES 2.0.1, es_java 2.0.2 LS 2.1.0-1, ES 2.0.1, es_java 2.0.2 LS 2.1.0-1, ES 2.1.0, es_java 2.0.2
While running LS 2.1.0-1, ES 2.1.0, es_java 2.0.2
I simply installed the license plugin, without marvel-agent, and the problem reappeared.
Can you guys replicate that?
Thanks for all your help so far.
@ceeeekay interesting, I believe that may fall under known behavior. This is one reason we're strongly considering removing node from the java output. It doesn't provide any speed benefits for Logstash and it exponentially increases the complexity of configuration.
FWIW we highly recommend using the standard HTTP output unless you've benchmarked and proven that transport is faster for your workload (and in that case we'd love to hear your story!). The performance difference between HTTP and transport/node is <3% generally (once you set workers to be > 1).
@ceeeekay I'll need one more piece of info. Is your license active or expired? Which plugins are enabled (marvel / shield / watcher).
@andrewvc the license is an active free Marvel license, but no plugins that require licenses are currently installed. I'll update on the reasons we use node shortly.
@ceeeekay any reason why you can't switch to transport
or http
protocol?
@ceeeekay For node protocol to work with Licensed plugins, you need to install an extra license plugin on Logstash. Thats why we don't recommend using node with Licensed plugins and prefer http or transport.
If you still like to use node protocol:
To install the Logstash License plugin:
bin/plugin install logstash-output-elasticsearch-license
@suyograo excellent. I'll get this installed and tested soon and get back to you guys my thinking on using node (I'm away from the office at the moment).
@suyograo running LS 2.1.0, trying to bin/plugin install logstash-output-elasticsearch-license
I get
ERROR: Installation Aborted, message: Bundler could not find compatible versions for gem "logstash-output-elasticsearch": In snapshot (Gemfile.lock): logstash-output-elasticsearch (= 2.1.4)
In Gemfile: logstash-output-elasticsearch-license (>= 0) java depends on logstash-output-elasticsearch (~> 0.2) java
logstash-output-elasticsearch (>= 0) java
This is with logstash-output-elasticsearch_java 2.0.2 already installed
Any ideas? Is there another version I can install?
@ceeeekay we're working on delivering a fixed version. In the meantime, what's your reason for using Node? We're genuinely curious when people use the Node protocol.
@andrewvc @suyograo
Throughput per node is the least of our worries - we're not trying to stay with node protocol for such a small speed gain.
Node protocol allows us to simplify our configuration by having the Logstash nodes connect to the masters to discover the rest of the cluster, which is also the discovery configuration we use for all our Elasticsearch nodes. Of all the nodes, the masters are the only ones we can absolutely guarantee will be present, as we treat all other nodes as somewhat expendable, as all other ES node types have at least one redundant node (more than one for data nodes).
Obviously we can't do this with HTTP, as it would put unnecessary load on the masters, which is not best-practice.
We also don't want to do this with transport protocol, for a similar reason. As I understand it, the transport connection would be a two-hop via the initial node, and then to the node with the particular shard we want to index to - so more unnecessary load on the masters if they're the first point of contact.
The only way to avoid load on the masters which doesn't require specific data nodes to be available would be to run a dedicated gateway node for HTTP or transport. In the interests of redundancy this would require two or more nodes, just to allow the use of a different protocol - which seems wasteful. This also seems like somewhat of a bottleneck, and not a great design for a scalable system.
Using node allows us to add theoretically unlimited Logstash nodes - without the traffic passing through any sort of bottleneck, as Logstash is connecting directly to every node in the ES cluster. As we are reading from a queue upstream of the Logstash indexers, we also find that this load balances between indexer nodes quite naturally, without the need for us to configure or plan anything, i.e., if we max out our Logstash throughput, we simply add more nodes.
We are also running monitoring on our management node which allows us to easily see which Logstash nodes are connected to the cluster. We would lose this with transport or HTTP.
There are probably some other minor points here, but changing protocols would essentially require us to change our architecture to get similar results, and I feel we would lose some of the resiliency, scalability and simplicity we have at the moment.
I hope that covers most of the reasons why we have a strong preference to stay with the node protocol. I can clarify any of this if you like, either here or in #logstash (BaM`).
Cheers, Chris
@ceeeekay those are all good points! Thank you so much for being so patient with these issues and describing your very sensible reasoning behind using node.
The good news is that the transport protocol should do what you want! The transport docs are however incorrect! I've just opened a new ticket on Elasticsearch core to fix the docs: https://github.com/elastic/elasticsearch/pull/15204
@andrewvc thanks - that sounds great, but I have some transport questions:
Is two-hop an issue via data nodes, performance-wise (i.e., does it add any unnecessary load to the ES cluster)?
How does Logstash with transport react if a data node goes away while it's indexing? Does it update the cluster topology immediately, or switch to another surviving node, or does it drop messages until it rescans the cluster? Does it even rescan? I could test this myself but my dev stack only has a single data node.
I'd like to make this as robust as possible and not have to worry about what Logstash is doing if the ES cluster changes in any way.
The only other thing missing is that we would lose visibility of which Logstash nodes are connected and indexing if we switch to transport. It's not a huge deal, but this does make things easier to monitor and manage.
@ceeeekay performance-wise the impact is negligible. I ran some benchmarks a while back and couldn't find any difference. Even HTTP is only ~2-3% slower than the native protocols (which is why we made it the default).
@ceeeekay with regard to what happens if a data node goes away it will switch to another node. However, you may get an error for that request. That being said, the logstash plugin will issue a retry either way. I don't believe node is any more robust in this regard (you'd get an error there too AFAIK).
We've had this discussion internally with members of Elasticsearch core and they much prefer the transport protocol over the node protocol.
WRT to losing logstash visibility that is an interesting point! We hadn't considered that!
I'm going to close this ticket for now since the original issue has been resolved.
Thank you very much for being so in detail @ceeeekay , these are the tickets that I really enjoy responding to!
@andrewvc @suyograo @jordansissel Thank you all for for your help with this - I really appreciate it.
@andrewvc Is there any ETA on a working logstash-output-elasticsearch-license? I'd still like to see how far I can get with it - mostly so we can monitor our Logstash nodes via the cluster.
@suyograo do you have an answer for @ceeeekay here wrt the license plugin?
Hi there,
I have a previously working config from Logstash 2.0.0-rc1-1 which was successfully indexing to Elasticsearch using the node protocol with logstash-output-elasticsearch_java.
After upgrading to Logstash 2.0.0-1 it can no longer connect to ES using the node protocol.
Logstash reports:
There are no errors in the logs on the ES master node, or issues with network connectivity. The ES cluster is green, and responding at all times.
If I change the protocol to "transport", Logstash connects and indexes, however this upsets our monitoring as the Logstash nodes no longer exist in the cluster.
Downgrading Logstash to 2.0.0-rc1-1 immediately resolves the problem, with no other changes required.
Can someone please confirm that node protocol works correctly with Logstash 2.0.0-1?