mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
68 stars 19 forks source link

etcd deployment fails with DCOS if framework found in Zookeeper #95

Open Radek44 opened 8 years ago

Radek44 commented 8 years ago

I setup etcd on my cluster using DCOS CLI a first time and it worked. I then uninstalled it. A couple days later I decided to reinstall but since, every installation is failing. It seems that the reason for this is that the framework is found in Zookeeper but fails at restoring. Here is the failure trace I got through the stderr file in mesos (just changed the IPs with x.x.x.x (agent) and y.y.y.y(mesos master):

+ /work/bin/etcd-mesos-scheduler -alsologtostderr=true -framework-name=etcd -cluster-size=3 -master=zk://master.mesos:2181/mesos -zk-framework-persist=zk://master.mesos:2181/etcd -v=1 -auto-reseed=true -reseed-timeout=240 -sandbox-disk-limit=4096 -sandbox-cpu-limit=1 -sandbox-mem-limit=2048 -admin-port=3356 -driver-port=3357 -artifact-port=3358 -framework-weburi=http://etcd.marathon.mesos:3356/stats
I0222 04:14:30.573426       7 app.go:218] Found stored framework ID in Zookeeper, attempting to re-use: b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003
I0222 04:14:30.575267       7 scheduler.go:209] found failover_timeout = 168h0m0s
I0222 04:14:30.575363       7 scheduler.go:323] Initializing mesos scheduler driver
I0222 04:14:30.575473       7 scheduler.go:792] Starting the scheduler driver...
I0222 04:14:30.575552       7 http_transporter.go:407] listening on x.x.x.x port 3357
I0222 04:14:30.575588       7 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@10.32.0.4:3357
I0222 04:14:30.575625       7 scheduler.go:821] starting master detector *zoo.MasterDetector: &{client:<nil> leaderNode: bootstrapLock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} bootstrapFunc:0x7991c0 ignoreInstalled:0 minDetectorCyclePeriod:1000000000 done:0xc2080548a0 cancel:0x7991b0}
I0222 04:14:30.575746       7 scheduler.go:999] Scheduler driver running.  Waiting to be stopped.
I0222 04:14:30.575776       7 scheduler.go:663] running instances: 0 desired: 3 offers: 0
I0222 04:14:30.575799       7 scheduler.go:671] PeriodicLaunchRequestor skipping due to Immutable scheduler state.
I0222 04:14:30.575811       7 scheduler.go:1033] Admin HTTP interface Listening on port 3356
I0222 04:14:30.607180       7 scheduler.go:374] New master master@y.y.y.y:5050 detected
I0222 04:14:30.607306       7 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
I0222 04:14:30.607466       7 scheduler.go:922] Reregistering with master: master@172.16.0.7:5050
I0222 04:14:30.607656       7 scheduler.go:881] will retry registration in 1.254807398s if necessary
I0222 04:14:30.610527       7 scheduler.go:769] Handling framework error event.
I0222 04:14:30.610636       7 scheduler.go:1081] Aborting framework [&FrameworkID{Value:*b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003,XXX_unrecognized:[],}]
I0222 04:14:30.610890       7 scheduler.go:1062] stopping messenger
I0222 04:14:30.610985       7 messenger.go:269] stopping messenger..
I0222 04:14:30.611076       7 http_transporter.go:476] stopping HTTP transport
I0222 04:14:30.611168       7 scheduler.go:1065] Stop() complete with status DRIVER_ABORTED error <nil>
I0222 04:14:30.611262       7 scheduler.go:1051] Sending error via withScheduler: Framework has been removed
I0222 04:14:30.611366       7 scheduler.go:298] stopping scheduler event queue..
I0222 04:14:30.611504       7 http_transporter.go:450] HTTP server stopped because of shutdown
I0222 04:14:30.611598       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611687       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611779       7 scheduler.go:250] finished processing scheduler events

Any suggestions on how to fix the deployment?

jdef commented 8 years ago

yep, we need better uninstall instructions for etcd on DCOS.

go to <dcos-hostname>/exhibitor and view the node tree. you should see etcd as a child of root. delete it. then try to re-install.

jdef commented 8 years ago

91

Radek44 commented 8 years ago

Brilliant. thank you @jdef this worked.

Radek44 commented 8 years ago

Quick addition - it looks like as soon as I try to scale out etcd using marathon (going from default 1 instance to 3 as recommended) the deployment of the 3 instances fails for the same reason.

jdef commented 8 years ago

@spacejam is this supported? I was under the impression that cluster size should be determined at framework startup time, and only then.

On Mon, Feb 22, 2016 at 1:11 PM, Radek Dabrowski notifications@github.com wrote:

Quick addition - it looks like as soon as I try to scale out etcd using marathon (going from default 1 instance to 3 as recommended) the deployment of the 3 instances fails for the same reason.

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/etcd-mesos/issues/95#issuecomment-187299276 .

spacejam commented 8 years ago

That's correct, @jdef. Marathon starts the etcd-mesos scheduler, rather than the instances of etcd (the instances are managed by what marathon or another higher-order supervisor framework starts). Marathon will show 1 instance running because there is only 1 etcd-mesos framework running with a particular configuration. The number of etcd instances should be determined at initialization time when submitting the app definition to marathon, for instance with the CLUSTER_SIZE env var in the provided example marathon spec:

{
  "id": "etcd",
  "container": {
    "docker": {
      "forcePullImage": true,
      "image": "mesosphere/etcd-mesos:0.1.0-alpha-target-23-24-25"
    },
    "type": "DOCKER"
  },
  "cpus": 0.2,
  "env": {
    "FRAMEWORK_NAME": "etcd",
    "WEBURI": "http://etcd.marathon.mesos:$PORT0/stats",
    "MESOS_MASTER": "zk://master.mesos:2181/mesos",
    "ZK_PERSIST": "zk://master.mesos:2181/etcd",
    "AUTO_RESEED": "true",
    "RESEED_TIMEOUT": "240",
    "CLUSTER_SIZE": "3",
    "CPU_LIMIT": "1",
    "DISK_LIMIT": "4096",
    "MEM_LIMIT": "2048",
    "VERBOSITY": "1"
  },
  "healthChecks": [
    {
      "gracePeriodSeconds": 60,
      "intervalSeconds": 30,
      "maxConsecutiveFailures": 0,
      "path": "/healthz",
      "portIndex": 0,
      "protocol": "HTTP"
    }
  ],
  "instances": 1,
  "mem": 128.0,
  "ports": [
    0,
    1,
    2
  ]
}
spacejam commented 8 years ago

actually, since you're using DCOS, you can specify the "cluster-size" configuration option for etcd to be something other than 3, but 3 is the default and recommended unless you are willing to trade slower writes for faster reads and additional availability with a cluster of 5.