Consul broken ? - Githubissues

nsteinmetz commented 8 years ago

Hi,

I started my picosluter and clusterlab with sd-card-rpi-v0.5.14.img and upgraded it.

As notice in #36, I saw that docker was no longer working. So I removed the /etc/docker/daemon.jsonfile but anyway. Docker and Cluster-lab starts well but consul container is always restarting.

For what I can see on my master node:

HypriotOS/armv7: pirate@pico-master in ~
$ docker ps
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS                         PORTS                    NAMES
c0398b11ece3        hypriot/rpi-swarm:1.2.2    "/swarm manage --repl"   9 seconds ago       Up 7 seconds                   0.0.0.0:2378->2375/tcp   cluster_lab_swarmmanage
7495b4163adb        hypriot/rpi-swarm:1.2.2    "/swarm join --advert"   11 seconds ago      Up 9 seconds                   2375/tcp                 cluster_lab_swarm
82457f97e74d        hypriot/rpi-consul:0.6.4   "/consul agent -serve"   15 seconds ago      Restarting (1) 2 seconds ago                            cluster_lab_consul

$ docker logs cluster_lab_consul
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Joining cluster...
==> dial tcp 192.168.200.1:8301: getsockopt: connection refused
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Joining cluster...
==> dial tcp 192.168.200.1:8301: getsockopt: connection refused

and:

$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/etc/systemd/system/docker.service; enabled)
   Active: active (running) since Fri 2016-05-27 21:34:19 UTC; 55s ago
     Docs: https://docs.docker.com
 Main PID: 1116 (docker)
   CGroup: /system.slice/docker.service
           ├─1116 /usr/bin/docker daemon --storage-driver overlay --host fd:// --debug --host tcp://192.168.200.31:2375 --cluster-advertise 192.168.200.31:2375 --cluster-sto...
           ├─1121 docker-containerd -l /var/run/docker/libcontainerd/docker-containerd.sock --runtime docker-runc --debug --metrics-interval=0
           ├─1395 docker-containerd-shim 7495b4163adb8d323bfb41671212d75aef65d04ca5264519aa90f4dbd0f91e12 /var/run/docker/libcontainerd/7495b4163adb8d323bfb41671212d75aef65d...
           ├─1476 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 2378 -container-ip 172.17.0.3 -container-port 2375
           └─1480 docker-containerd-shim c0398b11ece30d3c24cc0c8c5ec1851302dff382b983472719dbecb0ba64036a /var/run/docker/libcontainerd/c0398b11ece30d3c24cc0c8c5ec1851302dff...

May 27 21:34:48 pico-master docker[1116]: time="2016-05-27T21:34:48.214344343Z" level=debug msg="logs: begin stream"
May 27 21:34:48 pico-master docker[1116]: time="2016-05-27T21:34:48.219792562Z" level=debug msg="logs: end stream"
May 27 21:34:53 pico-master docker[1116]: time="2016-05-27T21:34:53.723090296Z" level=debug msg="received containerd event: &types.Event{Type:\"start-container\",...x5748bd7d}"
May 27 21:34:53 pico-master docker[1116]: time="2016-05-27T21:34:53.726771710Z" level=debug msg="event unhandled: type:\"start-container\" id:\"82457f97e74d5251a6...464384893 "
May 27 21:34:53 pico-master docker[1116]: time="2016-05-27T21:34:53Z" level=debug msg="containerd: process exited" id=82457f97e74d5251a6f5b5a619f7bd61db00b1c5c92e...temPid=1784
May 27 21:34:53 pico-master docker[1116]: time="2016-05-27T21:34:53.924103578Z" level=debug msg="received containerd event: &types.Event{Type:\"exit\", Id:\"82457...x5748bd7d}"
May 27 21:34:58 pico-master docker[1116]: time="2016-05-27T21:34:58.647580399Z" level=warning msg="Registering as \"192.168.200.31:2375\" in discovery failed: can...n sessions"
May 27 21:34:58 pico-master docker[1116]: time="2016-05-27T21:34:58.683884447Z" level=error msg="discovery error: Get http://192.168.200.31:8500/v1/kv/docker/node...on refused"
May 27 21:34:58 pico-master docker[1116]: time="2016-05-27T21:34:58.684968644Z" level=error msg="discovery error: Put http://192.168.200.31:8500/v1/kv/docker/node...on refused"
May 27 21:34:58 pico-master docker[1116]: time="2016-05-27T21:34:58.686406219Z" level=error msg="discovery error: Unexpected watch error"
Hint: Some lines were ellipsized, use -l to show in full.

$ sudo systemctl status cluster-lab -l
● cluster-lab.service - hypriot-cluster-lab
   Loaded: loaded (/etc/systemd/system/cluster-lab.service; enabled)
   Active: active (exited) since Fri 2016-05-27 21:34:30 UTC; 12min ago
 Main PID: 888 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cluster-lab.service
           └─975 dhclient eth0.200

May 27 21:33:36 pico-master cluster-lab[327]: dpkg-query: error: error writing to '<standard output>': Broken pipe
May 27 21:33:46 pico-master cluster-lab[327]: Device "eth0.200" does not exist.
May 27 21:33:46 pico-master cluster-lab[327]: dpkg-query: error: error writing to '<standard output>': Broken pipe
May 27 21:33:47 pico-master cluster-lab[327]: dpkg-query: error: error writing to '<standard output>': Broken pipe
May 27 21:33:49 pico-master cluster-lab[888]: dpkg-query: error: error writing to '<standard output>': Broken pipe
May 27 21:33:49 pico-master cluster-lab[888]: dpkg-query: error: error writing to '<standard output>': Broken pipe
May 27 21:33:53 pico-master dhclient[965]: DHCPREQUEST on eth0.200 to 255.255.255.255 port 67
May 27 21:33:53 pico-master dhclient[965]: DHCPACK from 192.168.200.1
May 27 21:34:30 pico-master systemd[1]: Started hypriot-cluster-lab.

What else do you need ? How can I fix it ?

Thanks, Nicolas

firecyberice commented 8 years ago

Please paste the output of cluster-lab health. Does eth0.200 exist?

nsteinmetz commented 8 years ago

Hi,

$ sudo cluster-lab health

Internet Connection
  [PASS]   eth0 exists
  [PASS]   eth0 has an ip address
  [PASS]   Internet is reachable
  [PASS]   DNS works

Networking
  [FAIL]   eth0.200 exists
  [FAIL]   eth0.200 has correct IP from vlan network
  [FAIL]   Cluster leader is reachable
  [FAIL]   eth0.200 has exactly one IP
  [PASS]   eth0.200 has no local link address
  [PASS]   Avahi process exists
  [FAIL]   Avahi is using eth0.200
Cannot find device "eth0.200"
  [FAIL]   Avahi cluster-leader.service file exists
Cannot find device "eth0.200"

DNSmasq
  [PASS]   dnsmasq process exists
  [FAIL]   /etc/dnsmasq.conf backup file exists

Docker
  [PASS]   Docker is running
  [FAIL]   Docker is configured to use Consul as key-value store
  [FAIL]   Docker is configured to listen via tcp at port 2375
  [FAIL]   Docker listens on  via tcp at port 2375 (Docker-Engine)

Consul
  [PASS]   Consul Docker image exists
  [FAIL]   Consul Docker container is running
  [FAIL]   Consul is listening on port 8300
  [FAIL]   Consul is listening on port 8301
  [FAIL]   Consul is listening on port 8302
  [FAIL]   Consul is listening on port 8400
  [FAIL]   Consul is listening on port 8500
  [FAIL]   Consul is listening on port 8600
  [FAIL]   Consul API works
  [PASS]   No Cluster-Node is in status 'failed'
  [FAIL]   Consul is able to talk to Docker-Engine on port 7946 (Serf)

Swarm
  [FAIL]   Swarm-Join Docker container is running
  [FAIL]   Swarm-Manage Docker container is running
  [PASS]   Number of Swarm and Consul nodes is equal which means our cluster is healthy

As output of ifconfig, eth0.200 no longer exists :(

Thanks, Nicolas

nsteinmetz commented 8 years ago

If I do a cluster-lab stop and then cluster-lab start, then eth0.200 exists but consul still fails to start.

My master is now 192.168.200.31 but consul tries to connect on 192.168.200.1. It seems to be now my "worker 4".

I stopped cluster-lab on all nodes, then restarted cluster-lab from master node. From this point it worked well.

Is there any difference between systemctl cluster-lab start|stop and cluster-lab start|stop ?

Govinda-Fichtner commented 8 years ago

The important thing is that you first start what will become your leader node... It will announce its presence via Avahi. After one or two minutes you can start the follower nodes which then should join the leader to form a cluster.

Stopping the Cluster-Lab on all nodes was the right thing to do as it resets the configuration on all nodes. Am 28.05.2016 22:20 schrieb "Nicolas Steinmetz" notifications@github.com:

If I do a cluster-lab stop and then cluster-lab start, then eth0.200 exists but consul still fails to start.

My master is now 192.168.200.31 but consul tries to connect on 192.168.200.1. It seems to be now my "worker 4".

I stopped cluster-lab on all nodes, then restarted cluster-lab from master node. From this point it worked well.

Is there any difference between systemctl cluster-lab start|stop and cluster-lab start|stop ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hypriot/cluster-lab/issues/42#issuecomment-222327882, or mute the thread https://github.com/notifications/unsubscribe/AAkiJtzUOM0pV89GEN6T6sso-HkZOh7Gks5qGKN8gaJpZM4Io46x .

nsteinmetz commented 8 years ago

Hmm thanks, in fact, my issue happened due to an electrical issue at home. So all nodes restarted when electricity was back.

I thought that the 2mn issue was only for the 1st run and that for 2nd and later run, there were not this 2mn delay. I understand better now ; thanks !

But I tried to restart it (but not with shutting down all nodes indeed) with systemctl cluster-lab stop and start but seems that I had the docker issue (cf #36) and the eth0.200 issue that was no longer working.

Govinda-Fichtner commented 8 years ago

Seems this issue is solved.

nsteinmetz commented 8 years ago

Yes indeed ; I forgot to close it - I was only expecting a answer on the difference between systemctl cluster-lab stop/start vs cluster-lab stop/start.

hypriot / cluster-lab

Consul broken ? #42