Elasticsearch: not enough master nodes discovered during pinging

developius commented 7 years ago

Hi,

I'm trying to get Elasticsearch running on my own two-node Docker Swarm and am running into a problem. I've followed your guide at https://hub.docker.com/r/itzg/elasticsearch/ using the sample docker-compose.yml and this command:

docker stack deploy -c docker-compose.yml es

When inspecting the tasks running, I get this in the error logs:

[2017-06-27T10:45:24,533][WARN ][o.e.d.z.ZenDiscovery     ] [GWlegCA] not enough master nodes discovered during pinging (found [[]], but needed [-1]), pinging again
[2017-06-27T10:45:24,534][WARN ][o.e.d.z.UnicastZenPing   ] [GWlegCA] failed to resolve host [master]

I'm also finding that the Kibana web UI is not loading at docker-host-ip:5601 - is this related?

I've never used ES before, but I know my way around swarm (I think!). Please could you give me a hand? Thanks!

itzg commented 7 years ago

Sorry for the delay on this. I'll first make sure it still works for me as expected -- perhaps I broke something Swarm-related along the way.

itzg commented 7 years ago

@developius ...it took a few minutes, but mine did startup properly. So, let's compare notes. My Docker version is

) docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:15:36 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:15:36 2017
 OS/Arch:      linux/amd64
 Experimental: false

on Ubuntu 17.04 with kernel version 4.10.0-24-generic

Otherwise, of interest for comparison would be the container logs from the master instance:

Finding IPs. found! 10.0.0.3,172.25.0.3
Starting Elasticsearch with the options    -E path.conf=/conf   -E path.data=/data   -E path.logs=/data   -E transport.tcp.port=9300   -E http.port=9200 -E network.host=10.0.0.3,172.25.0.3 -E node.master=true -E node.data=false -E node.ingest=false -E discovery.zen.ping.unicast.hosts=master -E discovery.zen.minimum_master_nodes=1
Running as non-root...
[2017-06-29T03:57:09,128][INFO ][o.e.n.Node               ] [] initializing ...
[2017-06-29T03:57:09,926][INFO ][o.e.e.NodeEnvironment    ] [vQkhQ0w] using [1] data paths, mounts [[/data (/dev/mapper/ubuntu--vg-root)]], net usable_space [137.5gb], net total_space [226gb], spins? [possibly], types [ext4]
[2017-06-29T03:57:09,926][INFO ][o.e.e.NodeEnvironment    ] [vQkhQ0w] heap size [981.5mb], compressed ordinary object pointers [true]
[2017-06-29T03:57:09,932][INFO ][o.e.n.Node               ] node name [vQkhQ0w] derived from node ID [vQkhQ0waReqTczLIpq7pwA]; set [node.name] to override
[2017-06-29T03:57:09,933][INFO ][o.e.n.Node               ] version[5.4.2], pid[22], build[929b078/2017-06-15T02:29:28.122Z], OS[Linux/4.10.0-24-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_121/25.121-b13]
[2017-06-29T03:57:09,933][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -Des.path.home=/usr/share/elasticsearch-5.4.2]
[2017-06-29T03:57:25,600][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [aggs-matrix-stats]
[2017-06-29T03:57:25,601][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [ingest-common]
[2017-06-29T03:57:25,601][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [lang-expression]
[2017-06-29T03:57:25,602][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [lang-groovy]
[2017-06-29T03:57:25,602][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [lang-mustache]
[2017-06-29T03:57:25,603][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [lang-painless]
[2017-06-29T03:57:25,603][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [percolator]
[2017-06-29T03:57:25,603][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [reindex]
[2017-06-29T03:57:25,604][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [transport-netty3]
[2017-06-29T03:57:25,604][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] loaded module [transport-netty4]
[2017-06-29T03:57:25,607][INFO ][o.e.p.PluginsService     ] [vQkhQ0w] no plugins loaded
[2017-06-29T03:57:55,981][INFO ][o.e.d.DiscoveryModule    ] [vQkhQ0w] using discovery type [zen]
[2017-06-29T03:57:59,489][INFO ][o.e.n.Node               ] initialized
[2017-06-29T03:57:59,489][INFO ][o.e.n.Node               ] [vQkhQ0w] starting ...
[2017-06-29T03:57:59,759][INFO ][i.n.u.i.PlatformDependent] Your platform does not provide complete low-level API for accessing direct buffers reliably. Unless explicitly requested, heap buffer will always be preferred to avoid potential system instability.
[2017-06-29T03:58:00,211][INFO ][o.e.t.TransportService   ] [vQkhQ0w] publish_address {10.0.0.3:9300}, bound_addresses {10.0.0.3:9300}, {172.25.0.3:9300}
[2017-06-29T03:58:00,242][INFO ][o.e.b.BootstrapChecks    ] [vQkhQ0w] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-06-29T03:58:05,441][WARN ][o.e.d.z.UnicastZenPing   ] [vQkhQ0w] timed out after [5s] resolving host [master]
[2017-06-29T03:58:08,479][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] new_master {vQkhQ0w}{vQkhQ0waReqTczLIpq7pwA}{YNkTlQTJQRG-qaJMZGaqsQ}{10.0.0.3}{10.0.0.3:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-06-29T03:58:08,510][INFO ][o.e.h.n.Netty4HttpServerTransport] [vQkhQ0w] publish_address {10.0.0.3:9200}, bound_addresses {10.0.0.3:9200}, {172.25.0.3:9200}
[2017-06-29T03:58:08,515][INFO ][o.e.n.Node               ] [vQkhQ0w] started
[2017-06-29T03:58:08,578][INFO ][o.e.g.GatewayService     ] [vQkhQ0w] recovered [0] indices into cluster_state
[2017-06-29T03:59:36,068][WARN ][o.e.m.j.JvmGcMonitorService] [vQkhQ0w] [gc][young][78][3] duration [11.6s], collections [1]/[11.7s], total [11.6s]/[17.7s], memory [210mb]->[205.3mb]/[981.5mb], all_pools {[young] [184.1mb]->[17.7mb]/[256mb]}{[survivor] [0b]->[42.4mb]/[42.5mb]}{[old] [25.9mb]->[146mb]/[683mb]}
[2017-06-29T03:59:36,070][WARN ][o.e.m.j.JvmGcMonitorService] [vQkhQ0w] [gc][78] overhead, spent [11.6s] collecting in the last [11.7s]
[2017-06-29T03:59:36,833][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{fckZqQH}{fckZqQHqRL6vj8S5mAjiNQ}{F2NgTOUdQYmMjqATQNHYYg}{10.0.0.5}{10.0.0.5:9300},}, reason: zen-disco-node-join[{fckZqQH}{fckZqQHqRL6vj8S5mAjiNQ}{F2NgTOUdQYmMjqATQNHYYg}{10.0.0.5}{10.0.0.5:9300}]
[2017-06-29T03:59:49,349][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{lXP_bfK}{lXP_bfK8SE22FmgR88ylXQ}{ROycw9TQSJG-jyiledCXew}{10.0.0.6}{10.0.0.6:9300},}, reason: zen-disco-node-join[{lXP_bfK}{lXP_bfK8SE22FmgR88ylXQ}{ROycw9TQSJG-jyiledCXew}{10.0.0.6}{10.0.0.6:9300}]
[2017-06-29T04:00:00,538][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{Pv0920n}{Pv0920npSyyCQncjalSK8w}{IbxA-dtsQkubJAE9nI4OAg}{10.0.0.10}{10.0.0.10:9300},{k9JC0S4}{k9JC0S4CTVq-h90N5PACTg}{BP2OQr1KTK2acUSMIIV7bQ}{10.0.0.8}{10.0.0.8:9300},}, reason: zen-disco-node-join[{k9JC0S4}{k9JC0S4CTVq-h90N5PACTg}{BP2OQr1KTK2acUSMIIV7bQ}{10.0.0.8}{10.0.0.8:9300}, {Pv0920n}{Pv0920npSyyCQncjalSK8w}{IbxA-dtsQkubJAE9nI4OAg}{10.0.0.10}{10.0.0.10:9300}]
[2017-06-29T04:00:36,654][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] removed {{Pv0920n}{Pv0920npSyyCQncjalSK8w}{IbxA-dtsQkubJAE9nI4OAg}{10.0.0.10}{10.0.0.10:9300},{fckZqQH}{fckZqQHqRL6vj8S5mAjiNQ}{F2NgTOUdQYmMjqATQNHYYg}{10.0.0.5}{10.0.0.5:9300},{lXP_bfK}{lXP_bfK8SE22FmgR88ylXQ}{ROycw9TQSJG-jyiledCXew}{10.0.0.6}{10.0.0.6:9300},{k9JC0S4}{k9JC0S4CTVq-h90N5PACTg}{BP2OQr1KTK2acUSMIIV7bQ}{10.0.0.8}{10.0.0.8:9300},}, reason: zen-disco-node-failed({lXP_bfK}{lXP_bfK8SE22FmgR88ylXQ}{ROycw9TQSJG-jyiledCXew}{10.0.0.6}{10.0.0.6:9300}), reason(transport disconnected)[{lXP_bfK}{lXP_bfK8SE22FmgR88ylXQ}{ROycw9TQSJG-jyiledCXew}{10.0.0.6}{10.0.0.6:9300} transport disconnected], zen-disco-node-failed({Pv0920n}{Pv0920npSyyCQncjalSK8w}{IbxA-dtsQkubJAE9nI4OAg}{10.0.0.10}{10.0.0.10:9300}), reason(transport disconnected)[{Pv0920n}{Pv0920npSyyCQncjalSK8w}{IbxA-dtsQkubJAE9nI4OAg}{10.0.0.10}{10.0.0.10:9300} transport disconnected], zen-disco-node-failed({k9JC0S4}{k9JC0S4CTVq-h90N5PACTg}{BP2OQr1KTK2acUSMIIV7bQ}{10.0.0.8}{10.0.0.8:9300}), reason(transport disconnected)[{k9JC0S4}{k9JC0S4CTVq-h90N5PACTg}{BP2OQr1KTK2acUSMIIV7bQ}{10.0.0.8}{10.0.0.8:9300} transport disconnected], zen-disco-node-failed({fckZqQH}{fckZqQHqRL6vj8S5mAjiNQ}{F2NgTOUdQYmMjqATQNHYYg}{10.0.0.5}{10.0.0.5:9300}), reason(transport disconnected)[{fckZqQH}{fckZqQHqRL6vj8S5mAjiNQ}{F2NgTOUdQYmMjqATQNHYYg}{10.0.0.5}{10.0.0.5:9300} transport disconnected]
[2017-06-29T04:01:26,129][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{L5__TRc}{L5__TRcPTQyK0ZlCIjB3Rg}{hHCYeLOURViYdQlRX8FwYw}{10.0.0.6}{10.0.0.6:9300},}, reason: zen-disco-node-join[{L5__TRc}{L5__TRcPTQyK0ZlCIjB3Rg}{hHCYeLOURViYdQlRX8FwYw}{10.0.0.6}{10.0.0.6:9300}]
[2017-06-29T04:01:27,422][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{6pZRj4J}{6pZRj4JIS_OtQWY4j4CWuA}{N5VsjXqHS7-kkvONKXSwOg}{10.0.0.8}{10.0.0.8:9300},}, reason: zen-disco-node-join[{6pZRj4J}{6pZRj4JIS_OtQWY4j4CWuA}{N5VsjXqHS7-kkvONKXSwOg}{10.0.0.8}{10.0.0.8:9300}]
[2017-06-29T04:01:34,245][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{jdNUbcq}{jdNUbcqZS-elAC8sDSjKmw}{vXsUJ8zNQrSvPCEXwf7JNA}{10.0.0.5}{10.0.0.5:9300},}, reason: zen-disco-node-join[{jdNUbcq}{jdNUbcqZS-elAC8sDSjKmw}{vXsUJ8zNQrSvPCEXwf7JNA}{10.0.0.5}{10.0.0.5:9300}]
[2017-06-29T04:01:43,623][INFO ][o.e.c.s.ClusterService   ] [vQkhQ0w] added {{LknhmNI}{LknhmNIESxysd8wjHt9VWg}{YZyGgQH2QxO9uDB4eR2U1w}{10.0.0.10}{10.0.0.10:9300},}, reason: zen-disco-node-join[{LknhmNI}{LknhmNIESxysd8wjHt9VWg}{YZyGgQH2QxO9uDB4eR2U1w}{10.0.0.10}{10.0.0.10:9300}]
[2017-06-29T04:01:59,413][INFO ][o.e.c.m.MetaDataCreateIndexService] [vQkhQ0w] [.kibana] creating index, cause [api], templates [], shards [1]/[1], mappings [server, config]
[2017-06-29T04:02:01,023][INFO ][o.e.c.r.a.AllocationService] [vQkhQ0w] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana][0]] ...]).

itzg commented 7 years ago

...I just noticed I did have a few false starts of task containers

docker stack ps es
ID                  NAME                IMAGE                       NODE                DESIRED STATE       CURRENT STATE            ERROR                              PORTS
0lfcam10uwyw        es_gateway.1        itzg/elasticsearch:latest   zenbook             Running             Running 7 minutes ago                                       
maddv824lpyz        es_ingest.1         itzg/elasticsearch:latest   zenbook             Running             Running 7 minutes ago                                       
s8qridgtxuko        es_data.1           itzg/elasticsearch:latest   zenbook             Running             Running 7 minutes ago                                       
iatuj8c8j2mv        es_ingest.1         itzg/elasticsearch:latest   zenbook             Shutdown            Failed 8 minutes ago     "task: non-zero exit (137): do…"   
9m3smgb01mv7        es_gateway.1        itzg/elasticsearch:latest   zenbook             Shutdown            Failed 8 minutes ago     "task: non-zero exit (137): do…"   
9rh7u3rwan11        es_data.1           itzg/elasticsearch:latest   zenbook             Shutdown            Failed 8 minutes ago     "task: non-zero exit (137): do…"   
7po4yx56mym8        es_kibana.1         kibana:latest               zenbook             Running             Running 12 minutes ago                                      
23fvsrj3lkm9        es_ingest.1         itzg/elasticsearch:latest   zenbook             Shutdown            Failed 10 minutes ago    "task: non-zero exit (137): do…"   
weiiqw4yh1kv        es_gateway.1        itzg/elasticsearch:latest   zenbook             Shutdown            Failed 10 minutes ago    "task: non-zero exit (137): do…"   
w3rvxnuwebja        es_data.1           itzg/elasticsearch:latest   zenbook             Shutdown            Failed 10 minutes ago    "task: non-zero exit (137): do…"   
pyj218h001s0        es_master.1         itzg/elasticsearch:latest   zenbook             Running             Running 10 minutes ago                                      
zrr2oqbutvnc        es_data.2           itzg/elasticsearch:latest   zenbook             Running             Running 7 minutes ago                                       
vnee1y4u07rx         \_ es_data.2       itzg/elasticsearch:latest   zenbook             Shutdown            Failed 8 minutes ago     "task: non-zero exit (137): do…"   
pybwyz713haq         \_ es_data.2       itzg/elasticsearch:latest   zenbook             Shutdown            Failed 10 minutes ago    "task: non-zero exit (137): do…"

For experimenting, you could instead try this minimal composition that I just pushed. It's not really making much use of Swarm, but eliminates a lot of moving parts.

developius commented 7 years ago

Thanks for getting back to me 🙌

Here are the details:

root@docker-1:~# docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64
 Experimental: false
root@docker-1:~#
root@docker-1:~# uname -a
Linux docker-1 4.9.20-std-1 #1 SMP Tue Apr 4 12:56:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@docker-1:~#
root@docker-1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial
root@docker-1:~#

I just tried that config you posted, output below :/

root@docker-1:~# docker stack ps es
ID            NAME         IMAGE                      NODE      DESIRED STATE  CURRENT STATE          ERROR  PORTS
20l649fh70hm  es_kibana.1  kibana:latest              docker-2  Running        Running 2 minutes ago         
5g2l47givix3  es_master.1  itzg/elasticsearch:latest  docker-1  Running        Running 2 minutes ago

Finding IPs. found! 10.255.0.6,172.18.0.3,10.0.0.3
Starting Elasticsearch with the options    -E path.conf=/conf   -E path.data=/data   -E path.logs=/data   -E transport.tcp.port=9300   -E http.port=9200 -E network.host=10.255.0.6,172.18.0.3,10.0.0.3 -E discovery.zen.ping.unicast.hosts=master -E discovery.zen.minimum_master_nodes=1
Running as non-root...
[2017-06-29T18:15:02,398][INFO ][o.e.n.Node               ] [] initializing ...
[2017-06-29T18:15:02,746][INFO ][o.e.e.NodeEnvironment    ] [MidDlKN] using [1] data paths, mounts [[/data (/dev/vda)]], net usable_space [40.2gb], net total_space [45.7gb], spins? [possibly], types [ext4]
[2017-06-29T18:15:02,749][INFO ][o.e.e.NodeEnvironment    ] [MidDlKN] heap size [981.5mb], compressed ordinary object pointers [true]
[2017-06-29T18:15:02,756][INFO ][o.e.n.Node               ] node name [MidDlKN] derived from node ID [MidDlKN6QXOnHaPvHUFrdA]; set [node.name] to override
[2017-06-29T18:15:02,757][INFO ][o.e.n.Node               ] version[5.4.2], pid[20], build[929b078/2017-06-15T02:29:28.122Z], OS[Linux/4.9.20-std-1/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_121/25.121-b13]
[2017-06-29T18:15:02,759][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -Des.path.home=/usr/share/elasticsearch-5.4.2]
[2017-06-29T18:15:08,005][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [aggs-matrix-stats]
[2017-06-29T18:15:08,005][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [ingest-common]
[2017-06-29T18:15:08,009][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [lang-expression]
[2017-06-29T18:15:08,010][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [lang-groovy]
[2017-06-29T18:15:08,013][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [lang-mustache]
[2017-06-29T18:15:08,014][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [lang-painless]
[2017-06-29T18:15:08,015][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [percolator]
[2017-06-29T18:15:08,016][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [reindex]
[2017-06-29T18:15:08,017][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [transport-netty3]
[2017-06-29T18:15:08,017][INFO ][o.e.p.PluginsService     ] [MidDlKN] loaded module [transport-netty4]
[2017-06-29T18:15:08,022][INFO ][o.e.p.PluginsService     ] [MidDlKN] no plugins loaded
[2017-06-29T18:15:13,732][INFO ][o.e.d.DiscoveryModule    ] [MidDlKN] using discovery type [zen]
[2017-06-29T18:15:15,977][INFO ][o.e.n.Node               ] initialized
[2017-06-29T18:15:15,979][INFO ][o.e.n.Node               ] [MidDlKN] starting ...
[2017-06-29T18:15:16,118][INFO ][i.n.u.i.PlatformDependent] Your platform does not provide complete low-level API for accessing direct buffers reliably. Unless explicitly requested, heap buffer will always be preferred to avoid potential system instability.
[2017-06-29T18:15:16,602][INFO ][o.e.t.TransportService   ] [MidDlKN] publish_address {10.0.0.3:9300}, bound_addresses {172.18.0.3:9300}, {10.0.0.3:9300}, {10.255.0.6:9300}
[2017-06-29T18:15:16,635][INFO ][o.e.b.BootstrapChecks    ] [MidDlKN] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-06-29T18:15:16,777][WARN ][o.e.d.z.UnicastZenPing   ] [MidDlKN] failed to resolve host [master]
java.net.UnknownHostException: master: Name does not resolve
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_121]
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) ~[?:1.8.0_121]
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) ~[?:1.8.0_121]
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276) ~[?:1.8.0_121]
    at java.net.InetAddress.getAllByName(InetAddress.java:1192) ~[?:1.8.0_121]
    at java.net.InetAddress.getAllByName(InetAddress.java:1126) ~[?:1.8.0_121]
    at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:922) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:877) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:674) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$null$0(UnicastZenPing.java:213) ~[elasticsearch-5.4.2.jar:5.4.2]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_121]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.2.jar:5.4.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-06-29T18:15:19,889][INFO ][o.e.c.s.ClusterService   ] [MidDlKN] new_master {MidDlKN}{MidDlKN6QXOnHaPvHUFrdA}{qBCxNqNmTw-tXfTNUZcMfA}{10.0.0.3}{10.0.0.3:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-06-29T18:15:20,045][INFO ][o.e.h.n.Netty4HttpServerTransport] [MidDlKN] publish_address {10.0.0.3:9200}, bound_addresses {172.18.0.3:9200}, {10.0.0.3:9200}, {10.255.0.6:9200}
[2017-06-29T18:15:20,059][INFO ][o.e.n.Node               ] [MidDlKN] started
[2017-06-29T18:15:20,137][INFO ][o.e.g.GatewayService     ] [MidDlKN] recovered [0] indices into cluster_state

developius commented 7 years ago

Just upgraded to Docker 17.06.0-ce and getting the same problem. The root of the issue seems to be failed to resolve host [master]. Oddly, I can exec into a master container and successfully ping master. The Kibana container is complaining that it can't reach master too (confirmed via exec ping)

itzg commented 7 years ago

Hmm, your container is getting assigned a third 10.255.. IP address, but that might just be a coincidence. I need to get my multi-node cluster up and running again to confirm there's not a subtle, but important difference there.

itzg commented 7 years ago

@developius , sorry took longer than I wanted to get my 3-node swarm going again. Well...good news is...I see the same "master: Name does not resolve" as you. Perhaps an additional, private overlay network is needed within the es stack/composition. I'll poke around.

developius commented 7 years ago

Awesome to hear that it's not just me, thanks!

itzg commented 7 years ago

...even though I see that, the kibana service did start and find the master ES node successfully. I'm also adding an ES data node per swarm node using this compose file:

https://gist.github.com/itzg/a185d87e4e1a888b9bdd45b7aa55ce19#file-docker-compose-yml

Now my only challenge is squeezing these into 1GB VMs :)

itzg commented 7 years ago

To trim down memory usage (esp for memory-constrained test/demo scenarios), I pushed an update to the image that adds a NON_DATA node type. With that this is the stack that is now working for me:

https://github.com/itzg/dockerfiles/blob/master/elasticsearch/docker-compose-3x1GB.yml

Showing:

ID                  NAME                                IMAGE                       NODE                DESIRED STATE       CURRENT STATE                ERROR               PORTS
82oj1p357rha        es_data.o3d9426rs2kpe9scco2mg3psm   itzg/elasticsearch:latest   rack2               Running             Running about a minute ago                       
mfv8qbg22ozk        es_data.n0afhmdvj1b5jyhuca4gt51ne   itzg/elasticsearch:latest   rack1               Running             Running about a minute ago                       
53z77f00ryti        es_data.lvfj3wx74d635ypixj3dtgw7g   itzg/elasticsearch:latest   rack3               Running             Running about a minute ago                       
x8gkfworxvee        es_kibana.1                         kibana:latest               rack1               Running             Running 2 minutes ago                            
5kxafikf23n4        es_master.1                         itzg/elasticsearch:latest   rack3               Running             Running about a minute ago

developius commented 7 years ago

I just tried your config on two VMs, and I'm getting this error in the data containers:

not enough master nodes discovered during pinging (found [[]], but needed [-1]), pinging again

Kibana is not starting up either, with this error:

Unable to revive connection: http://master:9200

And finally, the master container:

failed to resolve host [master]
java.net.UnknownHostException: master: Name does not resolve

itzg commented 7 years ago

Strange, the overlay network name resolution is acting differently for you. For sanity testing, do these cross-pinging services resolve names correctly for you:

version: '3'

services:
  first:
    image: alpine:3.5
    command: sh -c "sleep 5 ; ping second"

  second:
    image: alpine:3.5
    command: sh -c "sleep 5 ; ping first"

  master:
    image: alpine:3.5
    command: sh -c "sleep 5 ; ping gateway"

  gateway:
    image: alpine:3.5
    command: sh -c "sleep 5 ; ping master"

developius commented 7 years ago

Yep that works, although I'm getting this error intermittently (no idea what to do about it) but I think it's got something to do with it (that node is the second in the swarm).

$ docker service logs ping_gateway 
error from daemon in stream: Error grabbing logs: rpc error: code = 2 desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node z8noyw1ircju77fxmxn8tliue is not available

itzg commented 7 years ago

Thanks for checking. Hmm, must be something induced by the way elasticsearch is doing hostname resolution via Java. I'll do some more thinking.

developius commented 7 years ago

I came across this thread the other day when looking into another issue with one of my services, and it looks like this is the culprit! scaleway/image-ubuntu#78 Basically, there are some kernel modules missing for ubuntu on scaleway VPSes (what I'm using) which are causing problems with swarm networking. Having changed the bootscript to use the rancher kernel, everything started working.

My apologies for a false alarm!

itzg commented 7 years ago

Excellent. Glad to hear there was a logical reason for it.

itzg / docker-minecraft-server

Elasticsearch: not enough master nodes discovered during pinging #165