Hi,
I'm using etcd-mesos-0.1.0-alpha-target-23-24-25 with Apache Mesos.1.0.1 / Apache Marathon.1.1.0
on top of Centos-7.2 on x86_64 architecture.
etcd-mesos framework running on one of the mesos-slave had started configured amount of etcd-processes on other mesos-slaves. There were also three mesos-slaves in the system without etcd-processes.
Test1: I killed the local etcd-process from Linux command line (kill -9 )
Test2: I terminated one of mesos-slaves running etcd-processes.
After both tests the framework became unhealthy and didn't recover any more.
After that I removed the whole framework from marathon-mesos by ansible:
name: Delete etcd-server App from Marathon
local_action: uri url="http://{{ ansible_ssh_host }}:8080/v2/apps//etcd-server" method=DELETE
ignore_errors: yes
when: is_leader == True and etcd_status == "200" and clean_install |bool
and installed it again, but the same fault was still present.
I also noticed that after deleting the instance from marathon/mesos and starting it again the same etcd processes keep running, so the framework doesn't restart them. Should it ?
Please see the framework's docker log (the same was seen after both test cases):
W0907 09:23:19.344302 7 scheduler.go:595] Scheduler not yet in sync with master.
I0907 09:23:19.345025 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING
I0907 09:23:19.345061 7 zk.go:140] persisting reconciliation info to zookeeper
I0907 09:23:19.349749 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING
I0907 09:23:19.349773 7 zk.go:140] persisting reconciliation info to zookeeper
I0907 09:23:19.373683 7 scheduler.go:314] Status update: task etcd-1473238197 mesos-slave-2 31000 31001 31002 is in state TASK_LOST
E0907 09:23:19.373702 7 scheduler.go:333] Task contraction: TASK_LOST
E0907 09:23:19.373897 7 scheduler.go:334] message: Reconciliation: Task is unknown to the agent
E0907 09:23:19.373904 7 scheduler.go:335] reason: REASON_RECONCILIATION
I0907 09:23:20.347621 7 scheduler.go:584] Trying to sync with master.
I0907 09:23:20.347644 7 scheduler.go:592] Scheduler synchronized with master.
I0907 09:23:20.347650 7 scheduler.go:544] Scheduler transitioning to Mutable state.
I0907 09:23:21.158581 7 scheduler.go:847] skipping registration request: stopped=false, connected=true, authenticated=true
I0907 09:23:24.643545 7 offercache.go:70] We already have enough offers cached.
I0907 09:23:25.642771 7 offercache.go:70] We already have enough offers cached.
W0907 09:23:29.379762 7 scheduler.go:694] Prune attempting to deconfigure unknown etcd instance: etcd-1473238197
I0907 09:23:29.379778 7 membership.go:235] Attempting to remove task etcd-1473238197 from the etcd cluster configuration.
I0907 09:23:32.651704 7 offercache.go:70] We already have enough offers cached.
I0907 09:23:36.383689 7 membership.go:284] RemoveInstance response: {"message":"Internal Server Error"}
W0907 09:23:36.383733 7 membership.go:312] Failed to retrieve list of configured members. Backing off for 1 seconds and retrying.
I0907 09:23:37.659764 7 offercache.go:70] We already have enough offers cached.
I0907 09:23:42.662667 7 offercache.go:70] We already have enough offers cached.
Hi, I'm using etcd-mesos-0.1.0-alpha-target-23-24-25 with Apache Mesos.1.0.1 / Apache Marathon.1.1.0 on top of Centos-7.2 on x86_64 architecture.
etcd-mesos framework running on one of the mesos-slave had started configured amount of etcd-processes on other mesos-slaves. There were also three mesos-slaves in the system without etcd-processes. Test1: I killed the local etcd-process from Linux command line (kill -9)
Test2: I terminated one of mesos-slaves running etcd-processes.
After both tests the framework became unhealthy and didn't recover any more. After that I removed the whole framework from marathon-mesos by ansible:
and installed it again, but the same fault was still present.
I also noticed that after deleting the instance from marathon/mesos and starting it again the same etcd processes keep running, so the framework doesn't restart them. Should it ?
Please see the framework's docker log (the same was seen after both test cases):
W0907 09:23:19.344302 7 scheduler.go:595] Scheduler not yet in sync with master. I0907 09:23:19.345025 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING I0907 09:23:19.345061 7 zk.go:140] persisting reconciliation info to zookeeper I0907 09:23:19.349749 7 scheduler.go:314] Status update: task etcd-1473238196 mesos-slave-6 31000 31001 31002 is in state TASK_RUNNING I0907 09:23:19.349773 7 zk.go:140] persisting reconciliation info to zookeeper I0907 09:23:19.373683 7 scheduler.go:314] Status update: task etcd-1473238197 mesos-slave-2 31000 31001 31002 is in state TASK_LOST E0907 09:23:19.373702 7 scheduler.go:333] Task contraction: TASK_LOST E0907 09:23:19.373897 7 scheduler.go:334] message: Reconciliation: Task is unknown to the agent E0907 09:23:19.373904 7 scheduler.go:335] reason: REASON_RECONCILIATION I0907 09:23:20.347621 7 scheduler.go:584] Trying to sync with master. I0907 09:23:20.347644 7 scheduler.go:592] Scheduler synchronized with master. I0907 09:23:20.347650 7 scheduler.go:544] Scheduler transitioning to Mutable state. I0907 09:23:21.158581 7 scheduler.go:847] skipping registration request: stopped=false, connected=true, authenticated=true I0907 09:23:24.643545 7 offercache.go:70] We already have enough offers cached. I0907 09:23:25.642771 7 offercache.go:70] We already have enough offers cached. W0907 09:23:29.379762 7 scheduler.go:694] Prune attempting to deconfigure unknown etcd instance: etcd-1473238197 I0907 09:23:29.379778 7 membership.go:235] Attempting to remove task etcd-1473238197 from the etcd cluster configuration. I0907 09:23:32.651704 7 offercache.go:70] We already have enough offers cached. I0907 09:23:36.383689 7 membership.go:284] RemoveInstance response: {"message":"Internal Server Error"} W0907 09:23:36.383733 7 membership.go:312] Failed to retrieve list of configured members. Backing off for 1 seconds and retrying. I0907 09:23:37.659764 7 offercache.go:70] We already have enough offers cached. I0907 09:23:42.662667 7 offercache.go:70] We already have enough offers cached.