mantl / mesos-consul

Mesos to Consul bridge for service discovery
Apache License 2.0
338 stars 95 forks source link

Connect to Zookeeper fails #30

Open tnolet opened 8 years ago

tnolet commented 8 years ago

Hi,

getting this error:

2015/10/27 11:29:31 Connected to mesos01.bva.nu:2181
2015/10/27 11:29:31 Authenticated: id=94756170276864056, timeout=40000
2015/10/27 11:31:31 [ERROR] Timed out waiting for initial ZK detection

Seems like it can connect and then later times out. Any ideas?

ChrisAubuchon commented 8 years ago

The connect seems to be working. The timeout is caused by the zookeeper detector not finding a Mesos master in Zk. Can you log in to zookeeper and run ls /mesos and see if there are any json.info_000000 or info_000000 nodes?

tnolet commented 8 years ago

@ChrisAubuchon Sure, here you go. The info is right there. The mesos cluster itself is also performing fine, so no real Zookeeper issues on that side.

root@mesos01:~# /usr/share/zookeeper/bin/zkCli.sh 
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /mesos
[json.info_0000000025, json.info_0000000026, log_replicas]
[zk: localhost:2181(CONNECTED) 1] 
ChrisAubuchon commented 8 years ago

What version of mesos-consul are you running?

tnolet commented 8 years ago

@ChrisAubuchon I'm running it from this Docker container: ciscocloud/mesos-consul:v0.2.1

ChrisAubuchon commented 8 years ago

Okay. That's the issue. v0.2.1 doesn't support Mesos 0.24. There is a new container ciscocloud/mesos-consul:v0.3 that supports Mesos 0.24.

tnolet commented 8 years ago

@ChrisAubuchon I switched to v0.3. but get exactly the same. This is the log straight from the container:

root@mesos03:~# docker ps -a
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS                        PORTS               NAMES
be8acf83bbf8        ciscocloud/mesos-consul:v0.3     "/bin/mesos-consul --"   7 minutes ago       Exited (1) 5 minutes ago                          mesos-9d6e3ad9-5edc-42a4-bc2f-8035ece472a1-S0.268eb80b-1f51-43ea-b58d-985e99112237
root@mesos03:~# docker logs be8acf83bbf8
2015/10/27 14:23:10 Connected to mesos01.bva.nu:2181
2015/10/27 14:23:10 Authenticated: id=94756170276864062, timeout=40000
time="2015-10-27T14:25:10Z" level=fatal msg="Timed out waiting for initial ZK detection." 
root@mesos03:~# 
ChrisAubuchon commented 8 years ago

What is the zookeeper address that you are passing to the container? zk://mesos01.bva.nu:2181/mesos?

tnolet commented 8 years ago

@ChrisAubuchon Yes, but append with another zookeeper in the quorum. This is the exact app definition I deploy to Marathon. Of course, the hosts are resolvable as the connection is initially made if I interpret the logs correctly.

{
  "args": [
    "--zk=zk://mesos01.bva.nu:2181/mesos,zk://mesos02.bva.nu:2181/mesos"
  ],
  "container": {
    "type": "DOCKER",
    "docker": {
      "network": "HOST",
      "image": "ciscocloud/mesos-consul:v0.3"
    }
  },
  "id": "mesos-consul",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256
}
ChrisAubuchon commented 8 years ago

Ah! mesos-consul doesn't currently support a list of zookeeper servers. Try taking out the second zookeeper server. I think it's trying to read the /mesos,zk://mesos02.bva.nu:2181/mesos node inside of zk.

tnolet commented 8 years ago

@ChrisAubuchon Aha! Ok, well that mean I can't use it at all in production. We need a Zookeeper quorum to keep the state of Mesos. Connecting to just one of the members for state kinda defeats the whole purpose of having Mesos, Zookeeper, Consul etc. What's your opinion?

tnolet commented 8 years ago

@ChrisAubuchon It works with just one ZK instance. Lets mark this as solved for now. I'll see if can find some time to contribute a patch for this. Also multi-port services is a thing we need...

ChrisAubuchon commented 8 years ago

@tnolet I think that the zookeeper detector CAN use multiple zookeeper servers but in a weird format. Try --zk=zk://mesos01.bva.nu:2181,mesos02.bva.nu:2181/mesos

tnolet commented 8 years ago

@ChrisAubuchon amazingly, that works. Although if one of the servers fails (I made a stupid typo) it doesn't try the other one...

ChrisAubuchon commented 8 years ago

@tnolet Is that at startup? I've tried an unreachable host, bad host name and a host not running zk and it is failing over correctly...

tnolet commented 8 years ago

@ChrisAubuchon yes, this was at startup. But the error was that the /mesos key was not there. This was my error. However, this should of course still trigger a failover to the other ZK instance. Not a big issue though.

cicide commented 8 years ago

I'm having this very same error, but the fix above to the --zk argument did not fix my issue.

[root@master1 ~]# mesos-consul --zk=zk://192.168.10.145:2181/mesos --log-level=DEBUG --consul --refresh=5s
INFO[0000] Using zookeeper: zk://192.168.10.145:2181/mesos
DEBU[0000] Zookeeper address                             zk=zk://192.168.10.145:2181/mesos
2016/04/11 20:17:26 Connected to 192.168.10.145:2181
2016/04/11 20:17:26 Authenticated: id=95710752094158869, timeout=40000
FATA[0120] Timed out waiting for initial ZK detection.

[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.

^]
telnet> close
Connection closed.
[root@master1 ~]#
[root@master1 ~]# mesos-consul --version
mesos-consul v0.3.1

[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.
ruok
imokConnection closed by foreign host.
[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.
status
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Clients:
 /192.168.10.149:46038[1](queued=0,recved=369,sent=369)
 /192.168.10.148:56186[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50194[1](queued=0,recved=222,sent=222)
 /192.168.10.146:53431[1](queued=0,recved=356,sent=356)
 /192.168.10.119:33931[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56184[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53429[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53432[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50038[1](queued=0,recved=2174,sent=2177)
 /192.168.10.145:50034[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56185[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50037[1](queued=0,recved=356,sent=356)
 /192.168.10.120:44014[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37536[1](queued=0,recved=356,sent=356)
 /192.168.10.121:49779[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53430[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46032[1](queued=0,recved=422,sent=424)
 /192.168.10.147:37533[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46034[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50035[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46036[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50036[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37531[1](queued=0,recved=711,sent=713)
 /192.168.10.149:46035[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56187[1](queued=0,recved=356,sent=356)
 /192.168.10.118:34270[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56182[1](queued=0,recved=709,sent=711)
 /192.168.10.115:42196[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37535[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37534[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46039[1](queued=0,recved=357,sent=357)
 /192.168.10.145:50445[0](queued=0,recved=1,sent=0)
 /192.168.10.149:46037[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50033[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53433[1](queued=0,recved=711,sent=713)
 /192.168.10.117:45342[1](queued=0,recved=356,sent=356)

Latency min/avg/max: 0/6/3725
Received: 17391
Sent: 17400
Connections: 36
Outstanding: 0
Zxid: 0x200000031
Mode: follower
Node count: 30
Connection closed by foreign host.
[root@master1 ~]#

[root@master1 ~]# consul version
Consul v0.6.4
Consul Protocol: 3 (Understands back to: 1)

[root@master1 ~]# mesos-master --version
mesos 0.28.0
ChrisAubuchon commented 8 years ago

@cicide Can you log into zookeeper and run ls /mesos?

cicide commented 8 years ago

Looks like you spotted it.

/opt/mesosphere/zookeeper/bin/zkCli.sh
Connecting to localhost:2181
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /mesos
Node does not exist: /mesos

Strange, when setting up mesos and zookeeper, when is /mesos created? I would have expected to see some errors from mesos-master or mesos-slave.

cicide commented 8 years ago

@ChrisAubuchon

a bad mesos zk file was the problem. It caused an invalid name to be used in zookeeper for what should have been /mesos. Thanks for pointing me in the right direction

thomasvincent commented 7 years ago

need to document in wiki