Closed hcguersoy closed 8 years ago
This is most likely a bug where the protocol doesn't get set because there are no servers specified.
Can you show the settings that you're using? I'm particularly curious whether or not you have left discovery.srv.servers
at the default.
Hi,
I set the discovery server (in this case a consul instance):
elasticsearch -Des.node.name=es-$node \
-Des.cluster.name=$CLUSTER_NAME \
-Des.network.host=0.0.0.0 \
-Des.index.number_of_shards=$AMOUNT_SHARDS \
-Des.index.number_of_replicas=$AMOUNT_REPLICAS \
-Des.discovery.zen.ping.multicast.enabled=false \
-Des.discovery.type=srv \
-Des.discovery.srv.query=elastic.service.consul \
-Des.discovery.srv.servers=${CONSUL_IP}:8600 \
-Des.discovery.srv.protocol=udp
The whole code is in https://github.com/hcguersoy/swarm-elastic-demo - it's a small demo there a small elasticsearch cluster is started.
When I ran your demo, the Consul VM was created successfully but it got stuck trying to create the swarm VM.
==> Creating a node for swarm master and starting it...
Creating VirtualBox VM...
Creating SSH key...
Starting VirtualBox VM...
Starting VM...
Error creating machine: Maximum number of retries (60) exceeded
Can you set the logging level to TRACE
and show the ES logs? Maybe those will give us a clue.
-Des.logger.discovery=TRACE
Hello Chris,
this is a strange error ... sounds like that the VM doesn't come up.
I've here some output from my tries with TCP but even on TRACE level the discovery is not very verbose:
[2015-11-28 14:31:15,435][INFO ][node ] [es-local] version[1.7.3], pid[47002], build[05d4530/2015-10-15T09:14:17Z]
[2015-11-28 14:31:15,437][INFO ][node ] [es-local] initializing ...
[2015-11-28 14:31:15,599][INFO ][plugins ] [es-local] loaded [srv-discovery], sites []
[2015-11-28 14:31:15,674][INFO ][env ] [es-local] using [1] data paths, mounts [[/Volumes/HGU2 (/dev/disk3)]], net usable_space [15gb], net total_space [59.2gb], types [hfs]
[2015-11-28 14:31:18,275][DEBUG][discovery.zen.elect ] [es-local] using minimum_master_nodes [-1]
[2015-11-28 14:31:18,288][DEBUG][discovery.zen.ping.unicast] [es-local] using initial hosts [], with concurrent_connects [10]
[2015-11-28 14:31:18,289][DEBUG][discovery.srv ] [es-local] using ping.timeout [3s], join.timeout [1m], master_election.filter_client [true], master_election.filter_data [false]
[2015-11-28 14:31:18,292][DEBUG][discovery.zen.fd ] [es-local] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-11-28 14:31:18,295][DEBUG][discovery.zen.fd ] [es-local] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2015-11-28 14:31:19,112][INFO ][node ] [es-local] initialized
[2015-11-28 14:31:19,113][INFO ][node ] [es-local] starting ...
[2015-11-28 14:31:19,329][INFO ][transport ] [es-local] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.10.37:9300]}
[2015-11-28 14:31:19,346][INFO ][discovery ] [es-local] swarmones/DW9mS9zDRsKQihWFwkBGvQ
[2015-11-28 14:31:19,350][TRACE][discovery.srv ] [es-local] starting to ping
[2015-11-28 14:31:19,353][TRACE][discovery.srv ] [es-local] Building dynamic discovery nodes...
[2015-11-28 14:31:19,568][DEBUG][discovery.srv ] [es-local] No nodes found
[2015-11-28 14:31:19,568][DEBUG][discovery.srv ] [es-local] Using dynamic discovery nodes []
[2015-11-28 14:31:19,571][TRACE][discovery.zen.ping.unicast] [es-local] [1] connecting to [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]
[2015-11-28 14:31:19,789][TRACE][discovery.zen.ping.unicast] [es-local] [1] connected to [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]
[2015-11-28 14:31:19,789][TRACE][discovery.zen.ping.unicast] [es-local] [1] sending to [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]
[2015-11-28 14:31:19,851][TRACE][discovery.zen.ping.unicast] [es-local] [1] received response from [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]: [ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[1], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[2], master [null], hasJoinedOnce [false], cluster_name[swarmones]}]
[2015-11-28 14:31:21,074][TRACE][discovery.srv ] [es-local] Building dynamic discovery nodes...
[2015-11-28 14:31:21,129][DEBUG][discovery.srv ] [es-local] No nodes found
[2015-11-28 14:31:21,129][DEBUG][discovery.srv ] [es-local] Using dynamic discovery nodes []
[2015-11-28 14:31:21,129][TRACE][discovery.zen.ping.unicast] [es-local] [1] sending to [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]
[2015-11-28 14:31:21,131][TRACE][discovery.zen.ping.unicast] [es-local] [1] received response from [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]: [ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[1], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[3], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[4], master [null], hasJoinedOnce [false], cluster_name[swarmones]}]
[2015-11-28 14:31:22,634][TRACE][discovery.srv ] [es-local] Building dynamic discovery nodes...
[2015-11-28 14:31:22,688][DEBUG][discovery.srv ] [es-local] No nodes found
[2015-11-28 14:31:22,688][DEBUG][discovery.srv ] [es-local] Using dynamic discovery nodes []
[2015-11-28 14:31:22,689][TRACE][discovery.zen.ping.unicast] [es-local] [1] sending to [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]
[2015-11-28 14:31:22,692][TRACE][discovery.zen.ping.unicast] [es-local] [1] received response from [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]: [ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[1], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[3], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[5], master [null], hasJoinedOnce [false], cluster_name[swarmones]}, ping_response{node [[es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]]], id[6], master [null], hasJoinedOnce [false], cluster_name[swarmones]}]
[2015-11-28 14:31:22,693][TRACE][discovery.srv ] [es-local] full ping responses: {none}
[2015-11-28 14:31:22,693][DEBUG][discovery.srv ] [es-local] filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2015-11-28 14:31:22,701][INFO ][cluster.service ] [es-local] new_master [es-local][DW9mS9zDRsKQihWFwkBGvQ][localhost][inet[/192.168.10.37:9300]], reason: zen-disco-join (elected_as_master)
[2015-11-28 14:31:22,705][TRACE][discovery.srv ] [es-local] cluster joins counter set to [1] (elected as master)
[2015-11-28 14:31:22,734][INFO ][http ] [es-local] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.10.37:9200]}
[2015-11-28 14:31:22,734][INFO ][node ] [es-local] started
[2015-11-28 14:31:22,752][INFO ][gateway ] [es-local] recovered [0] indices into cluster_state
[2015-11-28 14:32:11,465][INFO ][node ] [es-local] stopping ...
[2015-11-28 14:32:11,490][INFO ][node ] [es-local] stopped
[2015-11-28 14:32:11,490][INFO ][node ] [es-local] closing ...
[2015-11-28 14:32:11,509][INFO ][node ] [es-local] closed
I've updated the scripts a little bit to make the configuration more usable.
Hi Chris,
Here some more finding (still with version 1.5.0, have not tried yet your latest push).
If I use tcp
as the protocol and the mapped tcp port from the consul container (8653 mapped from 53/tcp) I observe the behavior as above.
But if I use tcp
as the protocol but the udp port from the consul container (8600 mapped from 53/udp) it runs - so udp is still used as written initially.
@hcguersoy I added a more log statements in #12 - could you try out that branch? There are some development instructions in case you're curious how to run it: https://github.com/github/elasticsearch-srv-discovery#development
Could you also double check that dig @<consul-IP> -p <consul-port> elastic.service.consul. SRV
works from inside your Elasticsearch container?
Hopefully these will narrow down the problem!
Hi, here are the DIG's, done inside container es-1
Using the UDP port:
root@d6553729f9db:/# dig @46.101.121.135 -p 8600 elastic.service.consul SRV
; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> @46.101.121.135 -p 8600 elastic.service.consul SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53721
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; QUESTION SECTION:
;elastic.service.consul. IN SRV
;; ANSWER SECTION:
elastic.service.consul. 0 IN SRV 1 1 9300 es-4.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-1.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-5.node.dc1.consul.
;; ADDITIONAL SECTION:
es-4.node.dc1.consul. 0 IN A 10.0.0.5
es-1.node.dc1.consul. 0 IN A 10.0.0.2
es-5.node.dc1.consul. 0 IN A 10.0.0.6
;; Query time: 2 msec
;; SERVER: 46.101.121.135#8600(46.101.121.135)
;; WHEN: Wed Dec 02 14:49:31 UTC 2015
;; MSG SIZE rcvd: 334
Using TCP-Port, setting TCP flag:
root@d6553729f9db:/# dig @46.101.121.135 -p 8653 elastic.service.consul +tcp SRV
; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> @46.101.121.135 -p 8653 elastic.service.consul +tcp SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17640
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 9, AUTHORITY: 0, ADDITIONAL: 9
;; QUESTION SECTION:
;elastic.service.consul. IN SRV
;; ANSWER SECTION:
elastic.service.consul. 0 IN SRV 1 1 9300 es-4.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-5.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-8.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-7.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-2.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-1.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-3.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-6.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-9.node.dc1.consul.
;; ADDITIONAL SECTION:
es-4.node.dc1.consul. 0 IN A 10.0.0.5
es-5.node.dc1.consul. 0 IN A 10.0.0.6
es-8.node.dc1.consul. 0 IN A 10.0.0.9
es-7.node.dc1.consul. 0 IN A 10.0.0.8
es-2.node.dc1.consul. 0 IN A 10.0.0.3
es-1.node.dc1.consul. 0 IN A 10.0.0.2
es-3.node.dc1.consul. 0 IN A 10.0.0.4
es-6.node.dc1.consul. 0 IN A 10.0.0.7
es-9.node.dc1.consul. 0 IN A 10.0.0.10
;; Query time: 3 msec
;; SERVER: 46.101.121.135#8653(46.101.121.135)
;; WHEN: Wed Dec 02 14:50:59 UTC 2015
;; MSG SIZE rcvd: 922
I'll try to test your branch today (CET) evening.
That looks like the correct output to me (including the fact that UDP only returns 3 records), so the problem has got to be the plugin.
Hi @chrismwendt ,
here is the log running the plugin from PR #12:
I've run it on my local machine, pointing to the Consul Server:
./bin/elasticsearch -Des.node.name=es-local \
-Des.cluster.name=swarmones \
-Des.network.host=0.0.0.0 \
-Des.discovery.zen.ping.multicast.enabled=false \
-Des.discovery.type=srv \
-Des.discovery.srv.query=elastic.service.consul \
-Des.discovery.srv.servers=$(docker-machine ip consul):8653 \
-Des.discovery.srv.protocol=tcp \
-Des.logger.discovery=TRACE
Dig runs fine with TCP:
$ dig @$(docker-machine ip consul) -p 8653 elastic.service.consul +tcp SRV
; <<>> DiG 9.8.3-P1 <<>> @46.101.212.77 -p 8653 elastic.service.consul +tcp SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28413
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; QUESTION SECTION:
;elastic.service.consul. IN SRV
;; ANSWER SECTION:
elastic.service.consul. 0 IN SRV 1 1 9300 es-3.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-2.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-1.node.dc1.consul.
;; ADDITIONAL SECTION:
es-3.node.dc1.consul. 0 IN A 10.0.0.4
es-2.node.dc1.consul. 0 IN A 10.0.0.3
es-1.node.dc1.consul. 0 IN A 10.0.0.2
;; Query time: 867 msec
;; SERVER: 46.101.212.77#8653(46.101.212.77)
;; WHEN: Thu Dec 3 12:39:31 2015
;; MSG SIZE rcvd: 334
This is puzzling. I pushed a commit to the fix-default-tcp
branch which logs the query as well. Could you try running it with that update, and try setting protocol to UDP and show the log for that too?
Also, try putting a .
at the end of the query: elastic.service.consul.
Hi,
here the new log outputs:
And with elastic.service.consul.
as query:
The Dig runs again fine:
dig @$(docker-machine ip consul) -p 8653 elastic.service.consul SRV +tcp
; <<>> DiG 9.8.3-P1 <<>> @46.101.213.107 -p 8653 elastic.service.consul SRV +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14172
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3
;; QUESTION SECTION:
;elastic.service.consul. IN SRV
;; ANSWER SECTION:
elastic.service.consul. 0 IN SRV 1 1 9300 es-3.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-2.node.dc1.consul.
elastic.service.consul. 0 IN SRV 1 1 9300 es-1.node.dc1.consul.
;; ADDITIONAL SECTION:
es-3.node.dc1.consul. 0 IN A 10.0.0.4
es-2.node.dc1.consul. 0 IN A 10.0.0.3
es-1.node.dc1.consul. 0 IN A 10.0.0.2
;; Query time: 91 msec
;; SERVER: 46.101.213.107#8653(46.101.213.107)
;; WHEN: Fri Dec 4 11:14:21 2015
;; MSG SIZE rcvd: 334
All of the settings are going through as expected, but for some reason the lookup returns nothing.
I have a few more ideas:
dig
query using tcpdump
iptables -L
in each containerI've added a piece of code to test this behavior. Now we've simply to find the differences between the code in the plugin and this code ;-) Call it simply with the IP/Name of the DNS-Server (Consul?) and the Port number.
Here's my run some minutes ago:
Record: elastic.service.consul. 0 IN SRV 1 1 9300 es-1.node.dc1.consul.
Record: elastic.service.consul. 0 IN SRV 1 1 9300 es-2.node.dc1.consul.
Record: elastic.service.consul. 0 IN SRV 1 1 9300 es-3.node.dc1.consul.
Name: elastic.service.consul.
Adress: es-1.node.dc1.consul.
Port: 9300
aRecord : es-1.node.dc1.consul. 0 IN A 10.0.0.2
Adress: 10.0.0.2
Name: elastic.service.consul.
Adress: es-2.node.dc1.consul.
Port: 9300
aRecord : es-2.node.dc1.consul. 0 IN A 10.0.0.3
Adress: 10.0.0.3
Name: elastic.service.consul.
Adress: es-3.node.dc1.consul.
Port: 9300
aRecord : es-3.node.dc1.consul. 0 IN A 10.0.0.4
Adress: 10.0.0.4
So the lib is able to retrieve the data from the Consul node and the consul is functional and there are no issues with the network.
The problem was right under our noses this whole time: the textbook ==
vs .equals
for string equality mistake.
It should be fixed in #13. @hcguersoy could you give that branch a try on your setup?
Hello Chris,
it looks now fine:
[2015-12-06 17:44:52,366][INFO ][node ] [es-local] initialized
[2015-12-06 17:44:52,367][INFO ][node ] [es-local] starting ...
[2015-12-06 17:44:52,569][INFO ][transport ] [es-local] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.10.37:9300]}
[2015-12-06 17:44:52,581][INFO ][discovery ] [es-local] swarmones/KSaozijqShOxAfR-rr2Wrw
[2015-12-06 17:44:52,583][TRACE][discovery.srv ] [es-local] starting to ping
[2015-12-06 17:44:52,585][TRACE][discovery.srv ] [es-local] Building dynamic discovery nodes...
[2015-12-06 17:44:52,745][TRACE][discovery.srv ] [es-local] adding 10.0.0.4:9300, transport_address inet[/10.0.0.4:9300]
[2015-12-06 17:44:52,792][TRACE][discovery.srv ] [es-local] adding 10.0.0.2:9300, transport_address inet[/10.0.0.2:9300]
[2015-12-06 17:44:52,841][TRACE][discovery.srv ] [es-local] adding 10.0.0.3:9300, transport_address inet[/10.0.0.3:9300]
[2015-12-06 17:44:52,842][DEBUG][discovery.srv ] [es-local] Using dynamic discovery nodes [[#srv-10.0.0.4:9300-inet[/10.0.0.4:9300]][localhost][inet[/10.0.0.4:9300]], [#srv-10.0.0.2:9300-inet[/10.0.0.2:9300]][localhost][inet[/10.0.0.2:9300]], [#srv-10.0.0.3:9300-inet[/10.0.0.3:9300]][localhost][inet[/10.0.0.3:9300]]]
[2015-12-06 17:44:52,844][TRACE][discovery.zen.ping.unicast] [es-local] replacing [#srv-10.0.0.2:9300-inet[/10.0.0.2:9300]][localhost][inet[/10.0.0.2:9300]] with temp node [#zen_unicast_1_#srv-10.0.0.2:9300-inet[/10.0.0.2:9300]#][localhost][inet[/10.0.0.2:9300]]
[2015-12-06 17:44:52,845][TRACE][discovery.zen.ping.unicast] [es-local] replacing [#srv-10.0.0.3:9300-inet[/10.0.0.3:9300]][localhost][inet[/10.0.0.3:9300]] with temp node [#zen_unicast_2_#srv-10.0.0.3:9300-inet[/10.0.0.3:9300]#][localhost][inet[/10.0.0.3:9300]]
[2015-12-06 17:44:52,845][TRACE][discovery.zen.ping.unicast] [es-local] [1] connecting (light) to [#zen_unicast_1_#srv-10.0.0.2:9300-inet[/10.0.0.2:9300]#][localhost][inet[/10.0.0.2:9300]]
[2015-12-06 17:44:52,846][TRACE][discovery.zen.ping.unicast] [es-local] replacing [#srv-10.0.0.4:9300-inet[/10.0.0.4:9300]][localhost][inet[/10.0.0.4:9300]] with temp node [#zen_unicast_3_#srv-10.0.0.4:9300-inet[/10.0.0.4:9300]#][localhost][inet[/10.0.0.4:9300]]
[2015-12-06 17:44:52,846][TRACE][discovery.zen.ping.unicast] [es-local] [1] connecting (light) to [#zen_unicast_2_#srv-10.0.0.3:9300-inet[/10.0.0.3:9300]#][localhost][inet[/10.0.0.3:9300]]
[2015-12-06 17:44:52,846][TRACE][discovery.zen.ping.unicast] [es-local] [1] connecting (light) to [#zen_unicast_3_#srv-10.0.0.4:9300-inet[/10.0.0.4:9300]#][localhost][inet[/10.0.0.4:9300]]
It finds now the other nodes using TCP :-)
I finally managed to get your demo working (the problem was an outdated version of docker/virtualbox, and I think eth1
must be advertised), and confirmed that it works and that more than 3 records are returned :sparkles:
Thanks for reporting this and helping track down the problem :smiley:
@hcguersoy This has been fixed in https://github.com/github/elasticsearch-srv-discovery/releases/tag/1.5.1 :sparkles:
:+1:
Hi,
I've just played around - thanks for this nice plugin.
I've run only in one issue regarding the protocol. In the class
org.elasticsearch.discovery.srv.SrvUnicastHostsProvider
you set the protocol, default to tcp:I was not able to build up a cluster using tcp so I debugged and observed that inside of the ExtendedResolver the embedded
SimpleResolver.useTCP()
always returns false, regardless of the configured protocol.Using UDP it works fine.