mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
67 stars 19 forks source link

etcd-mesos HA fails #104

Closed jamietti closed 8 years ago

jamietti commented 8 years ago

Hi, I'm using etcd-mesos-0.1.0-alpha-target-23-24-25 with Apache Mesos.1.0.1 / Apache Marathon.1.1.0 on top of Centos-7.2 on x86_64 architecture.

etcd-mesos framework running on one of the mesos-slave had started configured amount of etcd-processes on other mesos-slaves. There were also three mesos-slaves in the system without etcd-processes. Test1: I killed the local etcd-process from Linux command line (kill -9 ) Test2: I terminated one of mesos-slaves running etcd-processes.

After both tests the framework became unhealthy and didn't recover any more. After that I removed the whole framework from marathon-mesos by ansible:

and installed it again, but the same fault was still present.

I also noticed that after deleting the instance from marathon/mesos and starting it again the same etcd processes keep running, so the framework doesn't restart them. Should it ?

Please see the framework's docker log (the same was seen after both test cases):

W0907 09:23:19.344302 7 scheduler.go:595] Scheduler not yet in sync with master. I0907 09:23:19.345025 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING I0907 09:23:19.345061 7 zk.go:140] persisting reconciliation info to zookeeper I0907 09:23:19.349749 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING I0907 09:23:19.349773 7 zk.go:140] persisting reconciliation info to zookeeper I0907 09:23:19.373683 7 scheduler.go:314] Status update: task etcd-1473238197 mesos-slave-2 31000 31001 31002 is in state TASK_LOST E0907 09:23:19.373702 7 scheduler.go:333] Task contraction: TASK_LOST E0907 09:23:19.373897 7 scheduler.go:334] message: Reconciliation: Task is unknown to the agent E0907 09:23:19.373904 7 scheduler.go:335] reason: REASON_RECONCILIATION I0907 09:23:20.347621 7 scheduler.go:584] Trying to sync with master. I0907 09:23:20.347644 7 scheduler.go:592] Scheduler synchronized with master. I0907 09:23:20.347650 7 scheduler.go:544] Scheduler transitioning to Mutable state. I0907 09:23:21.158581 7 scheduler.go:847] skipping registration request: stopped=false, connected=true, authenticated=true I0907 09:23:24.643545 7 offercache.go:70] We already have enough offers cached. I0907 09:23:25.642771 7 offercache.go:70] We already have enough offers cached. W0907 09:23:29.379762 7 scheduler.go:694] Prune attempting to deconfigure unknown etcd instance: etcd-1473238197 I0907 09:23:29.379778 7 membership.go:235] Attempting to remove task etcd-1473238197 from the etcd cluster configuration. I0907 09:23:32.651704 7 offercache.go:70] We already have enough offers cached. I0907 09:23:36.383689 7 membership.go:284] RemoveInstance response: {"message":"Internal Server Error"} W0907 09:23:36.383733 7 membership.go:312] Failed to retrieve list of configured members. Backing off for 1 seconds and retrying. I0907 09:23:37.659764 7 offercache.go:70] We already have enough offers cached. I0907 09:23:42.662667 7 offercache.go:70] We already have enough offers cached.

jamietti commented 8 years ago

This appeared to be a configuration error