etcd-mesos HA fails - Githubissues

Hi, I'm using etcd-mesos-0.1.0-alpha-target-23-24-25 with Apache Mesos.1.0.1 / Apache Marathon.1.1.0 on top of Centos-7.2 on x86_64 architecture.

etcd-mesos framework running on one of the mesos-slave had started configured amount of etcd-processes on other mesos-slaves. There were also three mesos-slaves in the system without etcd-processes. Test1: I killed the local etcd-process from Linux command line (kill -9 ) Test2: I terminated one of mesos-slaves running etcd-processes.

After both tests the framework became unhealthy and didn't recover any more. After that I removed the whole framework from marathon-mesos by ansible:

name: Delete etcd-server App from Marathon local_action: uri url="http://{{ ansible_ssh_host }}:8080/v2/apps//etcd-server" method=DELETE ignore_errors: yes when: is_leader == True and etcd_status == "200" and clean_install |bool

and installed it again, but the same fault was still present.

I also noticed that after deleting the instance from marathon/mesos and starting it again the same etcd processes keep running, so the framework doesn't restart them. Should it ?

Please see the framework's docker log (the same was seen after both test cases):

W0907 09:23:19.344302 I0907 09:23:19.345025 I0907 09:23:19.345061 I0907 09:23:19.349749 I0907 09:23:19.349773 I0907 09:23:19.373683 E0907 09:23:19.373702 E0907 09:23:19.373897 E0907 09:23:19.373904 I0907 09:23:20.347621 I0907 09:23:20.347644 I0907 09:23:20.347650 I0907 09:23:21.158581 I0907 09:23:24.643545 I0907 09:23:25.642771 W0907 09:23:29.379762 I0907 09:23:29.379778 I0907 09:23:32.651704 I0907 09:23:36.383689 W0907 09:23:36.383733 I0907 09:23:37.659764 I0907 09:23:42.662667 7 scheduler.go:595] Scheduler not yet in sync with master. 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING 7 zk.go:140] persisting reconciliation info to zookeeper 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING 7 zk.go:140] persisting reconciliation info to zookeeper 7 scheduler.go:314] Status update: task etcd-1473238197 mesos-slave-2 31000 31001 31002 is in state TASK_LOST 7 scheduler.go:333] Task contraction: TASK_LOST 7 scheduler.go:334] message: Reconciliation: Task is unknown to the agent 7 scheduler.go:335] reason: REASON_RECONCILIATION 7 scheduler.go:584] Trying to sync with master. 7 scheduler.go:592] Scheduler synchronized with master. 7 scheduler.go:544] Scheduler transitioning to Mutable state. 7 scheduler.go:847] skipping registration request: stopped=false, connected=true, authenticated=true 7 offercache.go:70] We already have enough offers cached. 7 offercache.go:70] We already have enough offers cached. 7 scheduler.go:694] Prune attempting to deconfigure unknown etcd instance: etcd-1473238197 7 membership.go:235] Attempting to remove task etcd-1473238197 from the etcd cluster configuration. 7 offercache.go:70] We already have enough offers cached. 7 membership.go:284] RemoveInstance response: {"message":"Internal Server Error"} 7 membership.go:312] Failed to retrieve list of configured members. Backing off for 1 seconds and retrying. 7 offercache.go:70] We already have enough offers cached. 7 offercache.go:70] We already have enough offers cached.

mesosphere-backup / etcd-mesos

etcd-mesos HA fails #104