`spread cluster start` fails to start up cluster

hharnisc commented 8 years ago

It's unclear how I got in this state, but I'm not able to start up a localkube cluster.

I've tried stoping/starting the cluster, removing all images and containers, re-creating a docker machine, and even going as far as re-installing docker.

The container seems to continuously restart

CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS                         PORTS               NAMES
5a1e299c3124        redspreadapps/localkube:latest   "start.sh"          15 minutes ago      Restarting (0) 2 minutes ago                       localkube

When I grab the container logs (docker log 5a1e299c3124) I get the following:

0bb5c03101f0f473218733b67258b04c07176225413651703e62295686adc014
1ef0f1618a97621cf0cca908d428cf466d3dc4b5f8ac4c1112d8829bb31dc147
10.32.0.1
Starting LocalKube...
Starting etcd...
2016-05-24 14:27:53.477939 I | etcdserver: recovered store from snapshot at index 460046
2016-05-24 14:27:53.478088 I | etcdserver: name = kubeetcd
2016-05-24 14:27:53.478126 I | etcdserver: data dir = /var/localkube/data
2016-05-24 14:27:53.478152 I | etcdserver: member dir = /var/localkube/data/member
2016-05-24 14:27:53.478175 I | etcdserver: heartbeat = 100ms
2016-05-24 14:27:53.478197 I | etcdserver: election = 1000ms
2016-05-24 14:27:53.478218 I | etcdserver: snapshot count = 10000
2016-05-24 14:27:53.478245 I | etcdserver: advertise client URLs = http://localhost:2379
2016-05-24 14:27:53.478289 I | etcdserver: loaded cluster information from store: <nil>
2016-05-24 14:27:54.295145 C | etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired
Plugin is not running.

mfburnett commented 8 years ago

Hey @hharnisc, try to stop localkube and remove all containers with spread cluster stop -r and then restart with spread cluster start - let me know if that fixes it.

hharnisc commented 8 years ago

@mfburnett still no luck

$ spread cluster stop -r
Stopping container '5a1e299c3124b361b895f1279f612f1174f7c5e2e9b5287a8ae077b12708f803'
Removing container '5a1e299c3124b361b895f1279f612f1174f7c5e2e9b5287a8ae077b12708f803'

then starting it

$ spread cluster start                                           
Creating localkube container...
Starting localkube container...

then checking the cluster

$ kubectl cluster-info
The connection to the server 192.168.99.100:8080 was refused - did you specify the right host or port?

hharnisc commented 8 years ago

Looking at that log it looks like etcd is having a bad time. Potentially blowing up here: https://github.com/coreos/etcd/blob/master/wal/wal.go#L271

hharnisc commented 8 years ago

@mfburnett @ethernetdan does localkube cache anything on the host filesystem?

hharnisc commented 8 years ago

rm -rf ~/.localkube seems to have got me unstuck. I wish I would have thought to keep of copy of data in there so you could use it to debug. If it happens again I'll be sure to include it.

mfburnett commented 8 years ago

@hharnisc hm glad you got unstuck, thanks for documenting it!

ibmendoza commented 8 years ago

It also happened to me under Turnkey Linux 14.1 but fortunately below worked. Thanks @mfburnett

spread cluster stop -r

spread cluster start

redspread / localkube

`spread cluster start` fails to start up cluster #63