mesos / elasticsearch

Elasticsearch on Mesos
Apache License 2.0
242 stars 92 forks source link

Uninstall instructions #571

Open nandanrao opened 8 years ago

nandanrao commented 8 years ago

It's really nice and easy to launch an elasticsearch cluster (in my case, into a DC/OS cluster) with this library. However, it's a little unclear to me how to remove a cluster / uninstall. Could use some mention of this in the docs!

nandanrao commented 8 years ago

(if the only correct way is to manually /teardown via mesos framework id, i'm happy to make a docs PR, but I suspect maybe there's another way?)

frankscholten commented 8 years ago

@nandanrao Thanks for opening this issue.

Indeed, tearing down via

curl -XPOST $MASTER/teardown -d 'frameworkId=794b66f4-2c4f-45cd-920b-8ee0b3555259-0001'

is the way to do it.

However, when testing this I might have found a bug. I did a teardown, the scheduler and executors were killed as expected and then Marathon restarted the scheduler. But now it did not launch a new executor. Instead the logs contained the following:

[DEBUG] 2016-06-22 12:37:16,061 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "794b66f4-2c4f-45cd-920b-8ee0b3555259-O245" }, framework_id { value: "794b66f4-2c4f-45cd-920b-8ee0b3555259-0001" }, slave_id { value: "794b66f4-2c4f-45cd-920b-8ee0b3555259-S3" }, hostname: "172.17.0.8", resources { name: "ports",  type: RANGES,  ranges {  range {   begin: 37000,    end: 38000,   },  },  role: "*" }, resources { name: "cpus",  type: SCALAR,  scalar {  value: 2.0,  },  role: "*" }, resources { name: "mem",  type: SCALAR,  scalar {  value: 4096.0,  },  role: "*" }, resources { name: "disk",  type: SCALAR,  scalar {  value: 20000.0,  },  role: "*" }, url { scheme: "http",  address {  hostname: "172.17.0.8",   ip: "172.17.0.8",   port: 5051,  },  path: "/slave(1)" }, Reason: Cluster size already fulfilled
[DEBUG] 2016-06-22 12:37:22,042 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 172.17.0.5
[DEBUG] 2016-06-22 12:37:22,047 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_172.17.0.6_20160622T123139.777Z exists, using old state: TASK_RUNNING

Cluster size already fullfilled means that it thinks there is one task running based on ZK state even though that executor has been killed.

I am looking into the Mesos code to understand how teardown works at a lower level. We might have to add code to do proper ZK state cleanup on teardown. The question is how to do this and where in the framework. I asked a question on the DC/OS community channel in #general https://dcos-community.slack.com

nandanrao commented 8 years ago

Yes I saw this as well, and ran the docker cleanup script -- which I believe in this case ONLY removed the zookeeper node which was named after the elasticsearch cluster. That SEEMS to have fixed it, although I did not look very closely.

philwinder commented 8 years ago

Personally, I always viewed shutdown as of secondary importance. Who wants their ES cluster to be destroyed? ;-)

But seriously, in the past I simply stopped the scheduler. There would be some state left in zookeeper, which is required just in case it failed on its own. You can manually delete that, or just ignore it. There's very little in there.

philwinder commented 8 years ago

Oh, and also check out #550. If the scheduler is closed, then the executors are closed, then the scheduler starts, the scheduler will still think that the executors are running, because we never receive any updates from Mesos to tell us that they've gone. An issue with Mesos IMO. But a "ping" mechanism to make sure they are still there would work around.